SJTU Summer Short Semester
Due 7/25 at the beginning of class
You are to write an original research paper that uses the IPUMS website to carry out a comparative study of time trends and age patterns of the demographic and socioeconomic characteristics by education, income, ethnicity, race, region, sex, or some other variables. The emphasis is on comparison. If you are interested in a particular ethnicity, for example, you still need to compare it to other ethnicities or the population as a whole to establish what is distinct about it.
Please read the following directions carefully. Since you have nearly two months to complete the project, there is no excuse for not complying with the instructions.
Your research paper should be 2000 words of text (roughly 4 single-spaced pages or 8 double-spaced pages) and 6 tables based on computations at the IPUMS site. The paper should be organized as the text, followed by the references, followed by the tables, with each table on a separate page. All tables should be publication quality according to the specifications below, not simply copied and pasted from the website. Do not insert tables into the main text. Please number all pages, and make sure that your name is on the first page.
The text should consist of four sections: Introduction, Background, Results, and Conclusion. Below I suggest guidelines for the lengths of each of these sections. These guidelines are not rigid, and depending on your topic and your findings the actual word count may differ. You may end up with more or fewer words in each section than
The Introduction should explain the overall focus of the paper and specify the questions that you are interested in. 250 words should be adequate.
The Background section that provides whatever information from other published sources you think may be necessary to help a reader understand the object of your study. For example, if your tables focus on comparison of different ethnic groups, you might provide a brief history of each group’s history in the United States that focuses on features relevant to the analysis. If you are comparing several major cities, you might want to mention key features of each relevant to your analyses. 500 words should be sufficient.
A Results section that discusses the tables one by one, and interprets their contents in light of hypotheses or theories in the introduction. The tables should be numbered consecutively, and referred to in the text as Table 1, Table 2 etc.
The Conclusion reviews the most interesting results in the paper and suggests further work. 250 words should be sufficient.
Each of the tables should examine relationships among a distinct set of variables. In other words, the tables should not be repetitions of the same basic tabulation but with different filters. At least two tables should make use of demographic or other variables unique to the American Community Survey (ACS) data, which are annual starting in 2001. At least two tables should make use of variables from the Decennial Census data.
You may also use the Current Population Survey (CPS) data at the IPUMS site. It tends to have much richer detail on labor force and employment characteristics. It may also be harder to use.
For some of your tables, you may also use General Social Survey (GSS) data, which is available at a different website (http://sda.berkeley.edu/cgi-bin/hsda?harcsda+gss10). It can be analyzed via a web interface like the one that you are already familiar with at IPUMS. The GSS includes questions on topics like religion, political views, and so forth that are not covered in the Census. Keep in mind that if you want to use the GSS, the tables you create should have something to do with demographic behavior, broadly defined.
Each table should also have a self-explanatory title, and the row and column headings should be sufficient to allow a reader to interpret the table without referring to your text. Each table should include a totals column and/or totals row as appropriate. Please format the tables so that there are no vertical lines, and only four horizontal lines: one between the title and the column headings, one between the column headings and the table contents, one between the table contents and the totals row, and one at the bottom. Basically the table should be formatted like the ones you see in the papers in the assigned reading. You will notice that in publications, tables almost never have vertical lines, and generally have a limited number of horizontal lines.
Either the title of the table or a note at the bottom of the table should specify any restrictions that were applied in selecting observations to be included in the calculation. Typically this means specifying the ages that were included in the calculation, the the years.
The tables should not be copied and pasted directly from the site, but rather should be prepared to look like they were publication quality, following the guidelines above.
The tables may be frequencies or cross-tabulations like the ones you are already used to. You are also encouraged to take advantage of some of the other tools available at the site. You are most likely to find the comparison of means tool (https://sda.usa.ipums.org/helpfiles/helpan.htm#means) the most useful. This allows you to calculate the mean of one variable for different combinations of other variables. For example, you could calculate mean income (INCTOT) for different combinations of RACE and YEAR. If you are more adventurous, you may try using the correlation or regression tools, but these can take a long time.
Filter variables to restrict the observations included in the analysis
In constructing your tables, make sure to select or filter observations correctly to make sure the ones you include are relevant. You can restrict the valid range of a variable used in the analysis to achieve the same effect as a filter: https://sda.usa.ipums.org/helpfiles/helpan.htm#range
Depending on the analysis that you are doing, you may want to use a filter to restrict to people of particular ages, or people with particular characteristics. For example, when looking at completed education, EDUC, you will almost always want to restrict to people aged 25 or over, so you will only be looking at people who have completed their education. Similarly, most of the income and occupation variables are only relevant for people of working ages, 18-55. For details on using the selection filter at IPUMS, please see https://sda.usa.ipums.org/helpfiles/helpan.htm#filter
Recode continuous variables like income, age etc. into a manageable number of categories
When constructing tables that are tabulations, you will also want to use recode for any variable that is continuous (a quantity), not discrete (a category). Examples include age, year of birth, and almost any of the income variables. If you are working with age, instead of having a separate row or column for each single year of age (1,2,3, etc.) you will want to have a limited number of age groups: 1-9, 10-19 and so on. Similarly, If you want to use total income (INCTOT), income from wages (INCWAGE), or other variables that record an amount in dollars, not a category, you will definitely need to recode the original values into into categories.
If you attempt to carry out a tabulation in which one of the income variables is a row, column, or control variable, and don’t record, the tabulation will almost certainly fail, with an error message indicating that there are too many rows or columns. The definition of your income categories will depend on the year that you are looking at. Because of inflation, typical incomes change dramatically over time. See https://sda.usa.ipums.org/helpfiles/helpan.htm#recode on how to carry out a recode.
Exclude observations with missing or not available (N/A) values
You will also need to exclude missing or not available (N/A) values, especially if you are computing a mean. In the IPUMS data, when information is missing for a variable in a particular observation, that is typically represented with a numeric value that will be included in any mean that you compute, unless you exclude it. This is especially important for income variables. In total income (INCTOT), missing data is represented by 9999999: https://usa.ipums.org/usa-action/variables/INCTOT/#codes_section. For wage income (INCWAGE), missing is represented as 999999: https://usa.ipums.org/usa-action/variables/INCWAGE/#codes_section. For the socioeconomic index (SEI), N/A is represented as 0: https://usa.ipums.org/usa-action/variables/SEI/#codes_section And so on. If you fail to exclude the numeric codes for missing values from the calculation of a mean, you may get peculiarly high values (if N/A was being represented as 999999) or particularly low values (if N/A was being represented as 0). If you are using other variables, you will need to check the documentation for them to see how missing or N/A was coded, and then exclude those values.
Demographic and Socioeconomic Characteristics to Treat as Outcomes/Dependent Variables
Basic demographic and socioeconomic variables available in most of the decennial Censuses that you might want to consider as outcomes (dependent variables) include but are not limited to:
- Current marital status (MARST)
- Number of children born (CHBORN)
- Age at first marriage (AGEMARR)
- Total individual income (INCTOT)
- Poverty status (POVERTY)
- Educational attainment (EDUC)
- Socioeconomic index (SEI) – this is a commonly used measure of the standing of an individual’s occupation.
- Of course if you have found another variable that you are interested in, you are welcome to use that. Some of you have mentioned school enrollment, home ownership, type of school, health insurance, and so forth.
The ACS also includes a rich set of demographic variables that could be used as outcomes. The ACS are the data that show up annually since 2000 for 2001, 2002, 2003 etc. The most interesting relevant to the class are some variables for very recent years that indicate whether certain events have occurred in the last year, and could be the basis of the calculation of rates, as opposed to percentages:
- Children born within the last year (FERTYR)
- Married, divorced or widowed within the last year (MARRINYR, DIVINYR, WIDINYR).
These lists are only meant as suggestions, and if you have other interests that can be addressed with other variables you have found, you may pursue them.
Demographic and Socioeconomic Characteristics to Treat as Explanatory/Independent Variables
Generally your explanatory variables should precede your outcome variables in time. That doesn’t always mean they have a causal effect on the outcome, but a causal interpretation is at least more plausible. So, for example, you might examine number of children born (CHBORN) for women aged 45 according to their level of education (EDUC), but you probably won’t think about studying the education of women aged 45 according to their number of children. The variables are of course the same in both cases, but the interpretation of which is an outcome and which is explanatory differs.
- Race (RACE) – Note that since 2000, Race includes codes identifying people who have said they were two or more races. There are also codes since 2000 for single races, for example, RACASIAN
- Hispanic (HISPAN) – Note that Hispanic status is separate from race.
- A variety of other nativity and ancestry variables are available at http://usa.ipums.org/usa-action/variables/group/race_eth. The availability of these variables tends to change over time, so there isn’t really one nativity or ancestry variable that is available on a continuous basis since 1850. I will post a separate guide to using some of the key variables.
- Geographic identifiers in http://usa.ipums.org/usa-action/variables/CITY#codes_section. Note that the IPUMS doesn’t offer any more detail than City, so with IPUMS you can’t compare different neighborhoods in the same city.
- Of course you can use EDUC, INCTOT and other variables as explanatory variables, just make sure that your dependent variable comes after them in time.
Examples of tables you could construct
- Use the comparison of means to look at mean number of children born for people of difference races in different years. In this case, you would select number of children as your dependent variable, and RACE and YEAR as row and column variables. You would probably want to filter to restrict to (for example) women who were old enough to have completed their childbearing, say 50 years old. You might want to restrict to decennial census years.
- Use the comparison of means to look at mean income for people of different ages with different levels of education. In this case you would select income as your dependent variable, and age and education as your rows and columns. You would probably want to set a filter to restrict to ages when people might actually have incomes, for example, 25-55. You would want to recode age so that instead of having fifty rows, one for each age, you have three rows, one for each ten year age group.
- My posts with IPUMS tips and tracks are accessible via http://camerondcampbell.me/category/ipums/ Make sure to review to see if there is anything that helps you.
- If you are trying to use an income variable such as INCTOT as a row or column variable, you will need to record it into a limited number of categories in order for a table to work. If you simply specify INCTOT or another income variable as a row or column variable, the table won’t run, because there are too many distinct values, requiring thousands of columns or rows. You will need to use the recode to regroup incomes into a manageable number of categories, and of course exclude 9999999 and 9999998.
- Most if not all of the income variables, including INCTOT, FINCTOT, and HINCTOT, code missing values or not available as 9999999, 9999998, 999999, 999998, or some variant thereof. INCTOT codes missing values as 9999999: https://usa.ipums.org/usa-action/variables/INCTOT/#codes_section. If you are carrying out a comparison of means, you need to exclude those observations because the average shouldn’t include these values. You could do this by putting inctot(*-9999997) in the filter.
- Similarly, If you are categorizing income, make sure that the highest category of income doesn’t include 9999998 and 9999999. For example, inctot(r:0-9999;10000-19999;20000-29999;30000-39999;40000-49999;50000-9999997)
- Many of the fertility variables use 0 to indicate missing or no response, 1 to indicate no births or no children. For example, the ACS variable FERTYR is 0 for Not Available, 1 for no births in the last year, and 2 for one or more births in the last year: https://usa.ipums.org/usa-action/variables/FERTYR#codes_tab . Similarly, CHBORN is 0 for not available, 1 for no children, 2 for one child, and so forth: In those cases, 2 often means 1 child, 3 means 2 children and so forth: https://usa.ipums.org/usa-action/variables/CHBORN#codes_tab Be attentive to this when you interpret . If you are computing mean number of children, or mean numbers of births, you will often want to subtract one from the numbers you present.
- If you are computing averages of any variables via Comparison of Means, make sure to inspect the detailed documentation for those variables to find out how missing values are coded, and use a selection filter to exclude them.
- Again, use selection filters to make sure that the observations you include are relevant to the question you are interested in. For example, if you want to use school to look at whether or not someone is currently enrolled in school, you would want to restrict to people who have a chance of being currently enrolled by applying a selection filter based on age. Restricting to age(14-18), for example, would let you look at people who were eligible to be eligible to be in high school. If you are looking at completed education, normally you would want to restrict to ages 25 and above.
- Remember that not every variable is available in every year. For the variables you are interested in, check to see which years they are available in. Some very interesting variables are only available in one or two years. The variables related to ethnicity, nativity, and origin are especially prone to change.
- Remember that 2001-2009 are based on the ACS. If you just want to present data from the decennial Census, you would restrict to years 1850-2000, and if you just wanted ACS data, you would restrict to 2001-2009.
- Keep in mind that the ACS has some nice variables that allow for direct computation of certain demographic rates, like whether or not someone has married in the last year, whether or not someone has had a birth in the last year, and so forth.