Evaluations from the Summer 2014 CMGPD Workshop at SJTU

We conducted the 4th CMGPD summer workshop at Shanghai JIaotong University this summer. As usual, we conducted a survey at the end to get student feedback. I’m a fan of making student evaluations public, so I have uploaded the scanned forms via the link below:


Overall, I was pleased with the results of the workshop. We have always had good participants. This year, however, we were fortunate to have an especially large share of participants who were interested in historical topics, and had some facility with quantitative methods. In previous offerings, participants were often one or the other.

The students made presentations on the last day with preliminary results and I think that with some more work, many of them can be turned into papers.

Studies that receive attention in the media

Recently I have started teaching research design classes at the undergraduate and graduate levels. By research design, I mean basic elements of study design and analysis such as translating concepts into measures and theories into hypotheses, sampling, questionnaire design, and experimental and quasi-experimental designs, causal inference, and so forth. This has been a new experience for me, and I am still struggling to find a way of turning the class from one in which I am talking at the students to one that revolves around projects that crystallize their understanding of the issues we are covering.

I was very lucky to have taken a really outstanding research design class from Herb Smith when I was studying for the PhD at Penn, but I have no hope of replicating it. I’ve been going through all of my old notes and assignments from that class, and I came to the conclusion that if I made the students do that much work, they would rebel. It’s unfortunate because in retrospect that is one of the most important classes I took in graduate school, in the sense of having a long-term impact on the way that I think.

One thing I am doing now in preparation for my next time round with the research design classes is assembling a list of studies, good and bad, that have received attention in the media.  What I am looking for are studies which have received a lot of attention in the media and which in terms of design are examples of specific designs, good and bad, and where the strengths and more commonly limitations are fairly straightforward. Accordingly I am avoiding studies where possible critiques revolve around subtle issues related to sampling or questionnaire design. I may develop another list for that.

As various studies come to my attention, I am going to add links to them here, so I can refer students here when I ask them to select a study and assess it. Of course I welcome suggestions. I am not looking for gold standard studies. Rather, I am looking for studies, good and bad, that have received a lot of attention in the media.

In some cases, I linking to discussions of debates about a study or topic.

Here goes:

  1. Warning labels on antidepressants and teen suicide
  2. Hurricane fatalities according to the gender of the hurricane’s name
  3. Estimating the number of participants in the July 1 march
  4. College educated children and old age mortality Discussion of the findings at Washington Post New York Times Slate
  5. Marijuana legalization and painkiller abuse Articles at CNN Vice
  6. What kinds of posts does the Chinese government censor Discussion of the article in Science
  7. OK Cupid’s controversial experiments on the users of its dating site. An Op-Ed piece from one of the founders.
  8. The strange dispute over whether eating together has positive effects on families, or at least on children.
  9. Outcomes of children raised by gay parents. Another summary of the dispute, and a critique of the study signed by 200 researchers.




Summer 2014 China Multigenerational Panel Dataset Workshop at SJTU (English announcement)

The 4th China Multigenerational Panel Dataset Workshop
Shanghai Jiaotong University, Minhang Campus
Shanghai, China

July 14-25, 2014


The Center for the History and Society of Northeast China at the Shanghai Jiaotong University School of Humanities will hold its 4th summer China Multigenerational Panel Data workshop from July 14 to July 25.

The workshop will focus on introducing the China Multigenerational Panel Datasets (CMGPD) as sources for the study of demography, stratification, and social and family history. These include the China Multigenerational Panel Dataset – Liaoning (CMGPD-LN) and the China Multigenerational Panel Dataset – Shuangcheng (CMGPD-SC).  The CMGPD have been released via the Inter-university Consortium for Political and Science Research.  The latest versions of the CMGPD document are available for download.

The CMGPD datasets have many unique features that make them useful not only for the study of Chinese population, social, and family history, but for the study of demographic, social and economic processes more generally.  Their features also make them useful as testbeds for researchers developing novel quantitative techniques.  The datasets are longitudinal, multi-generational, and structured at multiple levels, including the individual, the household, the kin group, the community, the administrative unit, and the region.

UCLA Professor of Sociology Cameron Campbell will be the primary lecturer. Guest lecturers will include Distinguished Professor and Dean of Humanities and Social Sciences at the Hong Kong University of Science and Technology James Lee; Yuxue Ren, Professor of History at Shanghai Jiaotong University; and Dong Hao, PhD student at the Hong Kong University of Science and Technology.

This class is intended to 1) introduce researchers to the CMGPD datasets and help them decide whether they may be useful in their own studies, 2) give current users an opportunity to learn more about the origin and context of the data, and 3) give participants basic instruction in the use of STATA to describe, organize and analyze the data.   Researchers who have already started using the CMGPD-SC or CMGPD-LN are welcome to attend and take advantage of the opportunity to discuss any questions they may have with Lee, Campbell, and others who were involved in the creation of the dataset.

Lectures and discussion will focus on 1) the historical, social, economic and institutional context of the populations covered by the data, 2) key features of the data, and 3) potential applications.  There will be optional sessions to introduce the Training Guide and demonstrate basic procedures for downloading the data from the website and loading it into STATA.

Please note that while there will be basic instruction in the use of STATA to organize and analyze the data, this is not intended as a class in STATA, or introductory statistics. Students looking specifically for instruction in STATA, statistics, or data management are encouraged to look elsewhere. Again, the class is intended for participants who want to assess whether CMGPD is suitable for their research interests, or are already considering the use of the CMGPD and seek basic instruction in the use of STATA to manipulate and analyze it.

The workshop will include daily exercises to introduce key features of the data, and STATA techniques for taking advantage of these features. Participants will also complete a small project of their own design using the data and present it on the last day of the workshop.

If any non-Chinese speakers enroll, the lectures will be in English.  If the participants all speak Chinese, lectures may be in Chinese, or a mixture of English and Chinese.  Discussion will be in English and Chinese.

The Shanghai Jiaotong University Center for the History and Society of Northeast China was established as a research unit by a collaboration of the Shanghai Jiaotong University (SJTU) School of the Humanities and the Hong Kong University of Science and Technology (HKUST) School of the Humanities and Social Sciences.


China Multigenerational Panel Dataset – Liaoning (CMGPD-LN)

The CMGPD-LN is an important dataset for the study of China’s family, social and demographic history, and for the study of demography and stratification more generally. The dataset is suitable for application of a wide variety of statistical techniques that are commonly used in social demography for the analysis of longitudinal, individual-level data, and available in the most popular statistical software packages. The dataset is distinguished by its size, temporal depth, and richness of detail on family, household and kinship context.

The materials from which the dataset was constructed are Shengjing Imperial Household Agency household registers held in the Liaoning Provincial Archives. The registers are triennial. Altogether there are 3600 of them. We transcribed a subset of them to produce the CMGPD-LN, which spans 160 years from 1749 to 1909. At present, the dataset comprises 29 register series, and consists of 1,500,000 records that describe 260000 individuals over seven generations. The CMGPD-LN is accordingly an important resource for the study of historical demography, sociology, economics, and other fields.

The CMGPD-LN and associated English-language documentation are already available for download at ICPSR.

China Multigenerational Panel Dataset – Shuangcheng (CMGPD-SC)

The CMGPD-SC covers communities of recent settlers in Shuangcheng, Heilongjiang in the last half of the nineteenth century and beginning of the twentieth. It contains 1.35 million records that describe 100,000 people. The registers cover descendants of urban migrants from Beijing and rural migrants from neighboring areas in northeast China who came to the area in the first half of the nineteenth century as part of a government organized effort to settle this largely vacant frontier region. One of the distinguishing features of this dataset is the availability of linked, individual-level landholding records for several points in time. The data also include a rich array of other indicators of household and family context and socioeconomic status.

Pending release of the CMGPD-SC through ICPSR, the data are available for download here.


Monday, July 14, 2014 to Friday, July 25, 2014

Shanghai Jiaotong University School of Humanities (SJTU Minhang Campus, Shanghai)

Application deadline
May 1, 2014

See link below to download application

Application procedure

Please send your personal statement, curriculum vitae, and application form (English or 中文) as attachments to chinanortheast@gmail.com.

Applications from faculty, postdoctoral researchers and graduate students are welcome. Applications from graduating college seniors will also be considered if they have already been accepted into a graduate program beginning fall 2014.  In that case, the application should include a copy of their graduate school acceptance. Any other interested parties should contact our staff at chinanortheast@gmail.com before applying to see if they will be considered.

Participants should be able to speak or read Chinese or English.  No prior experience in statistics, demography, or Chinese history is required.  Applicants must explain the reasons for their interest in the data in their application, and should demonstrate that they have background, experience or interests that in some way are relevant.

Participants who are Chinese nationals will have accommodations. Participants who are not Chinese nationals will receive assistance with arranging accommodations, and will receive a housing subsidy to help offset their costs. Participants who want other accommodations will have to arrange them on their own and will be responsible for all associated costs.

Participants should bring their own computer.

Students are responsible for all travel and local expenses, health care expenses, and other incidentals. Participants coming from abroad are strongly encouraged to confirm that their health insurance offers international coverage, or purchase travel health insurance.

Participants who are not Chinese nationals will need to obtain visas. We will issue invitation letters to facilitate the visa application. We strongly urge that accepted participants who need visas begin the application process as soon as possible after they are notified of their acceptance.

At present we expect to be able to accommodate 25-30 participants.


Required Reading

Read the following before the workshop begins.  The highest priority are the specified pages in in the CMGPD-LN and CMGPD-SC User Guides.


The documentation below is available here.

  • CMGPD-LN User Guide.  English pages 1-54, 90-96 or Chinese pages 13-64, 96-101.  Skim the descriptions of variables to look for ones that may be relevant to your research.
  • CMGPD-SC User Guide.  English pages 1-47. Again, skim the descriptions of variables to look for ones that may be relevant to your research.
  • CMGPD Training Guide. Pay particular attention to the sections at the beginning that introduce the data and highlight its distinctive characteristics.

Research Articles

  • Campbell, Cameron and James Lee. 2002 (publ. 2006). “State views and local views of population: Linking and comparing genealogies and household registers in Liaoning, 1749-1909.” History and Computing. 14(1+2):9-29.  http://papers.ccpr.ucla.edu/papers/PWP-CCPR-2004-025/PWP-CCPR-2004-025.pdf
  • Bengtsson, Tommy, Cameron Campbell, James Lee, et al. 2004.  Life Under Pressure: Mortality and Living Standards in Europe and Asia, 1700-1900. MIT Press.  Published in Chinese as 托米·本特森,康文林,李中清等. 2008. 压力下的生活:1700~1900年欧洲与亚洲的死亡率和生活水平. 北京: 社会科学文献出版社. Translated by 李霞 and 李恭忠.  Appendix A.
  • Campbell, Cameron and James Z. Lee. 2011. “Kinship and the Long-Term Persistence of Inequality in Liaoning, China, 1749-2005.” Chinese Sociological Review. 44(1):71-104.  http://www.ncbi.nlm.nih.gov/pubmed/23596557

Review Articles

  • 康文林 (Cameron Campbell).  2012.  “历史人口学 (Historical Demography).”  Chapter 8 in 梁在编 (Zai Liang ed.) 人口学 (Demography).   北京:人民大学出版社 (Beijing: Renmin University Press), 233-265.

Select one or two of the following research articles based on your own interests (or another published article that uses the CMGPD), and read before the workshop starts

  • CHEN Shuang, James Lee, and Cameron Campbell. 2010. “Wealth stratification and reproduction in Northeast China, 1866-1907.” History of the Family. 15:386-412.  http://www.ncbi.nlm.nih.gov/pubmed/21127716
  • Bengtsson, Tommy, Cameron Campbell, James Lee, et al. 2004.  Life Under Pressure: Mortality and Living Standards in Europe and Asia, 1700-1900. MIT Press.  Published in Chinese as 托米·本特森,康文林,李中清等. 2008. 压力下的生活:1700~1900年欧洲与亚洲的死亡率和生活水平. 北京: 社会科学文献出版社. Translated by 李霞 and 李恭忠.  Chapter 10.
  • Wang Feng, Cameron Campbell, and James Z. Lee. 2010. “Agency, Hierarchies, and Reproduction in Northeastern China, 1789 to 1840.” Chapter 11 in Noriko Tsuya, Wang Feng, George Alter, James Z. Lee et al. Prudence and Pressure: Reproduction and Human Agency in Europe and Asia, 1700-1900. MIT Press, 287-316.
  • Chen Shuang, Cameron Campbell, and James Z. Lee.  Forthcoming.  “Categorical Inequality and Gender Difference: Marriage and Remarriage in Northeast China, 1749-1912.”  Chapter 11 in Lundh, Christer, Satomi Kurosu, et al. Similarity in Difference.


If you are not familiar with STATA, prepare for the workshop by reviewing as many of the materials for learning and using STATA at UCLA IDRE as possible. You are also strongly encouraged to watch video tutorials at the STATA website. Ideally, by the time you arrive at the workshop, you should already be able to  carry out very basic operations in STATA such as loading and saving files, creating tabulations and so forth. Do try to download the CMGPD-SC or CMGPD-LN and make sure you know how to load them and carry out very simple operations.

Recommended Reading

  • As much of the User Guides and Training Guide as you can.
  • 定宜庄, 郭松义, 李中清, 康文林. 2004. 辽东移民中的旗人社会.  上海:上海社会科学出版社.
  • Lee, James and Cameron Campbell. 1997. Fate and Fortune in Rural China: Social Organization and Population Behavior in Liaoning, 1774-1873. Cambridge University Press.
  • 李中清,王丰.  2000.  人类的四分之一: 马尔萨斯的神话与中国的现实:1700-2000。  三联·哈佛燕京学术丛书。(English: Lee, James and Wang Feng.  1999.  One Quarter of Humanity: Malthusian Mythology and Chinese Reality, 1700-2000.)
  • Bengtsson, Tommy, Cameron Campbell, James Lee, et al. 2004.  Life Under Pressure: Mortality and Living Standards in Europe and Asia, 1700-1900. MIT Press.  Published in Chinese as 托米·本特森,康文林,李中清等. 2008. 压力下的生活:1700~1900年欧洲与亚洲的死亡率和生活水平. 北京: 社会科学文献出版社. Translated by 李霞 and 李恭忠.

Tentative Schedule (at Onedrive)


Preparation of the CMGPD-LN and accompanying documentation for public release via ICPSR DSDR was supported by NICHD R01 HD057175-01A1 “Multi-Generation Family and Life History Panel Dataset” with funds from the American Recovery and Reinvestment Act.

Preparation of the CMGPD-SC and accompanying documentation for public release via ICPSR DSDR was supported by NICHHD R01 HD070985-01 “Multi-generational Demographic and Landholding Data: CMGPD-SC Public Release.”

The CMGPD summer workshops in Shanghai have been supported by Shanghai Jiaotong University, the School of Humanities, the Department of History, and the Center for the Society and History of Northeast China.  We are also grateful to staff at a variety of campus units at SJTU for their logistical support.

Student evaluations for SOSC 1860 and SSMA 5010, Fall 2013


I received student evaluations for the two courses that I taught last fall, SOSC 1860 (Population and Society) and SSMA 5010 (Research Methods).

The former is a general education (Common Core in HKUST parlance) course aimed at freshmen and sophomores, while the latter is a required course in our self-taught Social Science MA program. I enjoyed teaching both courses. The students were bright and highly motivated.

Here are the evaluations for SOSC 1860.

I was initially surprised to read that the students in SOSC 1860 thought I required too much work, but eventually concluded this probably reflects that they have less prior exposure to open-ended written assignments and projects than students I have taught elsewhere. In fact, the course was a simplified version of an upper division course I taught regularly at UCLA that was only ten weeks long (versus thirteen here) yet had even more written assignments and reading. The assignments mostly required them to visit some websites to collect demographic data, and then write about trends and patterns. The final project required them to carry out an analysis at IPUMS. Talking to students here, it seems that they found the relatively open-ended assignments intimidating. The students here are just as smart and motivated as the ones I taught at UCLA, and they actually did a good job on the assignments and their final projects, thus I suspect their reaction may have more to do with lack of familiarity or confidence with open-ended written assignments than with any actual lack of ability. Several students I talked to said this was the first class they had ever taken that made such heavy use of written assignments. I will probably need to adjust the number of assignments next fall.

The evaluations for SSMA 5010 are unremarkable, and about what I expected. Some of the comments reflect that this was a new prep, and I will have to continue revising my course plan and the lecture slides. This is the first time I have taught a research methods course, and it was fun. The students were highly motivated and engaged, making it a relatively pleasant task.


Evaluations from my summer 2013 short course in Social Demography at SJTU

I received the evaluations from my summer 2013 short course in Social Demography at Shanghai Jiaotong University.  This undergraduate course is an abbreviated version of the one that I have taught at UCLA in the past, and am teaching at HKUST now.  If I understand the scores correctly, I don’t seem to have done too much damage.

Personally, I believe that aggregated information from teaching evaluations should be public, at least to students.  This should also be combined with efforts to maximize response rates.  I liked the system implemented at UCLA where the administration provided a list of students who had completed web-based evaluations in time for the instructor to provide a small amount of credit included in the calculation of the final grade.  Obviously, all the administration provided was a list of names.  It didn’t include the content of the responses.  We only saw the summary report on the evaluations and the collected written comments students after we turned in grades.

If you are having difficulty viewing the embedded Excel spreadsheet, you can download it here.

You can view other entries where I have posted the class evaluations.

SOSC 1860 W13 Final Project

Due Friday 11/29 at 11:59pm via TurnItIn.

You are to write an original research paper that uses sites such as the IPUMS USA, IPUMS International, IPUMS CPS, and the Hong Kong Census and Statistics website to carry out a comparative study of trends and patterns of demographic characteristics or behavior such as marriage, fertility, or migration by such other variables education, income, ethnicity, race, region, sex, or some other variables. 

Please read the following directions carefully.  Since you have nearly two months to complete the project and ask questions, there is no excuse for not complying with the instructions.

Your research paper should be roughly 2000 words of text (roughly 4 single-spaced pages or 8 double-spaced pages) and 5 tables based on computations at the IPUMS site or at other sites.   The paper should be organized as the text, followed by the references, followed by the tables, with each table on a separate page.  All tables should be publication quality according to the specifications below, not simply copied and pasted from the website.  Do not insert tables into the main text.  Please number all pages, and make sure that your name is on the first page.

The text should consist of four sections: Introduction, Background, Results, and Conclusion.  Below I suggest guidelines for the lengths of each of these sections.  These guidelines are not rigid, and depending on your topic and your findings the actual word count may differ.

The Introduction should explain the overall focus of the paper and explain why you think your topic is interesting.   250 words should be sufficient.

The Background section should provide whatever information from other published sources you think may be necessary to help a reader understand the object of your study.  For example, if your tables focus on comparison of different ethnic groups, you might provide a brief history of each group’s history in the United States that focuses on features relevant to the analysis.  If you are comparing several major cities, you might want to mention key features of each relevant to your analyses.  500 words should be sufficient.

A Results section that discusses the tables one by one, and interprets their contents.  The tables should be numbered consecutively, and referred to in the text as Table 1, Table 2 etc.

The Conclusion reviews the most interesting results in the paper and suggests further work.  250 words should be sufficient.


Each of your tables should examine relationships among a distinct set of variables.  In other words, the tables should not be repetitions of the same basic tabulation but with different filters. 

All of your tables should be ones that you generated yourself at one of the sites I have referred you to. The point of this exercise is to introduce you to data collection and analysis. Tables copied from yearbooks, statistical digests, government publications, or other sources, will not count toward your requirement.

The tables should not be repetitions of ones you have already constructed for a class assignment.

You may also use the Current Population Survey (CPS) data at the IPUMS site.  It tends to have much richer detail on labor force and employment characteristics.

For some of your tables, you may also use General Social Survey (GSS) data, which is available at a different website (http://sda.berkeley.edu/cgi-bin/hsda?harcsda+gss10).  It can be analyzed via a web interface like the one that you are already familiar with at IPUMS.  The GSS includes questions on topics like religion, political views, and so forth that are not covered in the Census.  Keep in mind that if you want to use the GSS, the tables you create should have something to do with demographic behavior, broadly defined.

If you would like to do some comparison with Hong Kong, you may produce up to two of your tables by analyzing data at the Hong Kong Census and Statistics website. The remaining tables must be from IPUMS, IPUMS International, IPUMS CPS, or GSS.

Each table should also have a self-explanatory title, and the row and column headings should be sufficient to allow a reader to interpret the table without referring to your text.  Each table should include a totals column and/or totals row as appropriate.  Please format the tables so that there are no vertical lines, and only four horizontal lines: one between the title and the column headings, one between the column headings and the table contents, one between the table contents and the totals row, and one at the bottom.  Basically the table should be formatted like the ones you see in the papers in the assigned reading.  You will notice that in publications, tables almost never have vertical lines, and generally have a limited number of horizontal lines.

Either the title of the table or a note at the bottom of the table should specify any restrictions that were applied in selecting observations to be included in the calculation.  Typically this means specifying the ages that were included in the calculation, the the years.

The tables should not be copied and pasted directly from the websites, but rather should be prepared to look like they were publication quality, following the guidelines above.

The tables may be frequencies or cross-tabulations like the ones you are already used to.  You are also encouraged to take advantage of some of the other tools available at the site.  You are most likely to find the comparison of means tool (https://sda.usa.ipums.org/helpfiles/helpan.htm#means) the most useful.   This allows you to calculate the mean of one variable for different combinations of other variables.  For example, you could calculate mean income (INCTOT) for different combinations of RACE and YEAR.  If you are more adventurous, you may try using the correlation or regression tools, but these can take a long time.

In constructing your tables, make sure to select or filter observations correctly to make sure the ones you include are relevant.  You can restrict the valid range of a variable used in the analysis to achieve the same effect as a filter: https://sda.usa.ipums.org/helpfiles/helpan.htm#range

Depending on the analysis that you are doing, you may want to use a filter to restrict to people of particular ages, or people with particular characteristics.  For example, when looking at completed education, EDUC, you will almost always want to restrict to people aged 25 or over, so you will only be looking at people who have completed their education.  Similarly, most of the income and occupation variables are only relevant for people of working ages, 18-55.  For details on using the selection filter at IPUMS, please see https://sda.usa.ipums.org/helpfiles/helpan.htm#filter

When constructing tables that are tabulations, you will also want to use recode for any variable that is continuous (a quantity), not discrete (a category).  Examples include age, year of birth, and almost any of the income variables.  If you are working with age, instead of having a separate row or column for each single year of age (1,2,3, etc.) you will want to have a limited number of age groups: 1-9, 10-19 and so on.  Similarly, If you want to use total income (INCTOT), income from wages (INCWAGE), or other variables that record an amount in dollars, not a category, you will definitely need to recode the original values into into categories.  If you attempt to carry out a tabulation in which one of the income variables is a row, column, or control variable, and don’t record, the tabulation will almost certainly fail, with an error message indicating that there are too many rows or columns.  The definition of your income categories will depend on the year that you are looking at.  Because of inflation, typical incomes change dramatically over time.  See  https://sda.usa.ipums.org/helpfiles/helpan.htm#recode on how to carry out a recode.

You will also need to exclude missing or not available (N/A) values, especially if you are computing a mean.  In the IPUMS data, when information is missing for a variable in a particular observation, that is typically represented with a numeric value that will be included in any mean that you compute, unless you exclude it.  This is especially important for income variables.  In total income (INCTOT), missing data is represented by 9999999: https://usa.ipums.org/usa-action/variables/INCTOT/#codes_section.  For wage income (INCWAGE), missing is represented as 999999: https://usa.ipums.org/usa-action/variables/INCWAGE/#codes_section.  For the socioeconomic index (SEI), N/A is represented as 0: https://usa.ipums.org/usa-action/variables/SEI/#codes_section And so on.  If you fail to exclude the numeric codes for missing values from the calculation of a mean, you may get peculiarly high values (if N/A was being represented as 999999) or particularly low values (if N/A was being represented as 0).  If you are using other variables, you will need to check the documentation for them to see how missing or N/A was coded, and then exclude those values.

Demographic and Socioeconomic Characteristics to Treat as Outcomes/Dependent Variables

Basic demographic and socioeconomic variables available in most of the decennial Censuses that you might want to consider as outcomes (dependent variables) include but are not limited to:

  • Current marital status (MARST)
  • Number of children born (CHBORN)
  • Age at first marriage (AGEMARR)
  • Total individual income (INCTOT)
  • Poverty status (POVERTY)
  • Educational attainment (EDUC)
  • Socioeconomic index (SEI) – this is a commonly used measure of the standing of an individual’s occupation.
  • Of course if you have found another variable that you are interested in, you are welcome to use that.  Some of you have mentioned school enrollment, home ownership, type of school, health insurance, and so forth.

The ACS also includes a rich set of demographic variables that could be used as outcomes.  The ACS are the data that show up annually since 2000 for 2001, 2002, 2003 etc.  The most interesting relevant to the class are some variables for very recent years that indicate whether certain events have occurred in the last year, and could be the basis of the calculation of rates, as opposed to percentages:

These lists are only meant as suggestions, and if you have other interests that can be addressed with other variables you have found, you may pursue them.

Demographic and Socioeconomic Characteristics to Treat as Explanatory/Independent Variables

Generally your explanatory variables should precede your outcome variables in time.  That doesn’t always  mean they have a causal effect on the outcome, but a causal interpretation is at least more plausible.  So, for example, you might examine number of children born (CHBORN) for women aged 45 according to their level of education (EDUC), but you probably won’t think about studying the education of women aged 45 according to their number of children.  The variables are of course the same in both cases, but the interpretation of which is an outcome and which is explanatory differs.

  • Race (RACE) – Note that since 2000, Race includes codes identifying people who have said they were two or more races.   There are also codes since 2000 for single races, for example, RACASIAN
  • Hispanic (HISPAN) – Note that Hispanic status is separate from race.
  • A variety of other nativity and ancestry variables are available at http://usa.ipums.org/usa-action/variables/group/race_eth.  The availability of these variables tends to change over time, so there isn’t really one nativity or ancestry variable that is available on a continuous basis since 1850.  I will post a separate guide to using some of the key variables.
  • Geographic identifiers in http://usa.ipums.org/usa-action/variables/CITY#codes_section.  Note that the IPUMS doesn’t offer any more detail than City, so with IPUMS you can’t compare different neighborhoods in the same city.
  • Of course you can use EDUC, INCTOT and other variables as explanatory variables, just make sure that your dependent variable comes after them in time.

Examples of tables you could construct

  • Use the comparison of means to look at mean number of children born for people of difference races in different years.  In this case, you would select number of children as your dependent variable, and RACE and YEAR as row and column variables.  You would probably want to filter to restrict to (for example) women who were old enough to have completed their childbearing, say 50 years old.  You might want to restrict to decennial census years.
  • Use the comparison of means to look at mean income for people of different ages with different levels of education.  In this case you would select income as your dependent variable, and age and education as your rows and columns.  You would probably want to set a filter to restrict to ages when people might actually have incomes, for example, 25-55.  You would want to recode age so that instead of having fifty rows, one for each age, you have three rows, one for each ten year age group.


  • My posts with IPUMS tips and tracks are accessible via http://camerondcampbell.me/category/ipums/ Make sure to review to see if there is anything that helps you.
  • If you are trying to use an income variable such as INCTOT as a row or column variable, you will need to record it into a limited number of categories in order for a table to work.  If you simply specify INCTOT or another income variable as a row or column variable, the table won’t run, because there are too many distinct values, requiring thousands of columns or rows.  You will need to use the recode to regroup incomes into a manageable number of categories, and of course exclude 9999999 and 9999998.
  • Most if not all of the income variables, including INCTOT, FINCTOT, and HINCTOT, code missing values or not available as 9999999,  9999998, 999999, 999998, or some variant thereof.  INCTOT codes missing values as 9999999: https://usa.ipums.org/usa-action/variables/INCTOT/#codes_section.  If you are carrying out a comparison of means, you need to exclude those observations because the average shouldn’t include these values.  You could do this by putting inctot(*-9999997) in the filter.
  • Similarly, If you are categorizing income, make sure that the highest category of income doesn’t include 9999998 and 9999999.  For example, inctot(r:0-9999;10000-19999;20000-29999;30000-39999;40000-49999;50000-9999997)
  • Many of the fertility variables use 0 to indicate missing or no response, 1 to indicate no births or no children.  For example, the ACS variable FERTYR is 0 for Not Available, 1 for no births in the last year, and 2 for one or more births in the last year: https://usa.ipums.org/usa-action/variables/FERTYR#codes_tab .  Similarly, CHBORN is 0 for not available, 1 for no children, 2 for one child, and so forth: In those cases, 2 often means 1 child, 3 means 2 children and so forth: https://usa.ipums.org/usa-action/variables/CHBORN#codes_tab   Be attentive to this when you interpret .  If you are computing mean number of children, or mean numbers of births, you will often want to subtract one from the numbers you present.
  • If you are computing averages of any variables via Comparison of Means, make sure to inspect the detailed documentation for those variables to find out how missing values are coded, and use a selection filter to exclude them.
  • Again, use selection filters to make sure that the observations you include are relevant to the question you are interested in.  For example, if you want to use school to look at whether or not someone is currently enrolled in school, you would want to restrict to people who have a chance of being currently enrolled by applying a selection filter based on age.  Restricting to age(14-18), for example, would let you look at people who were eligible to be eligible to be in high school.  If you are looking at completed education, normally you would want to restrict to ages 25 and above.
  • Remember that not every variable is available in every year.  For the variables you are interested in, check to see which years they are available in.  Some very interesting variables are only available in one or two years.  The variables related to ethnicity, nativity, and origin are especially prone to change.
  • Remember that 2001-2009 are based on the ACS.  If you just want to present data from the decennial Census, you would restrict to years 1850-2000, and if you just wanted ACS data, you would restrict to 2001-2009.
  • Keep in mind that the ACS has some nice variables that allow for direct computation of certain demographic rates, like whether or not someone has married in the last year, whether or not someone has had a birth in the last year, and so forth.

SOSC 1860 F13 Assignment 2 Introduction to UN Data

Due via TurnItIn on Friday, October 4 at midnight.

This assignment will introduce you to a very useful web resource for international demographic data, UN Data, and will hopefully prepare you for our discussions of mortality and decline around the world. You will examine trends in demographic rates in three countries by examining data on trends in infant mortality, life expectancy and total fertility rates that you gather from the site. You will not need to do any calculations for this, just look up numbers.

Pick three countries: one developed country in Europe or North America, one developing country somewhere else in the world, and one country in East or Southeast AsiaWe are going to use UN Data (http://data.un.org) to examine trends in infant mortality, life expectancy, and fertility from the fifties to the present for the countries you select.

You can gather data on infant mortality, life expectancy, the total fertility rate, and other social demographic and economic indicators at UN Data by typing in the name of your country and the indicator you are interested in, almost like doing a search on Google. For example, to find data on infant mortality in Mauritius, just type ‘infant mortality mauritius’ in the search box. A page will come up with search results from different UN databases and publications like “Key Global Indicators” “Millenium Development Goals Database” “World Health Organization” and so forth.

Note that for the searches below, you may need to check several results before you find one with a relatively complete series. Searching for infant mortality, for example, may turn up several different sets of numbers from different sources. You will want to pick the one with the longest reach.

The initial searches may yield projections for the future. Ignore those numbers for the time being, and only present data for years that have already elapsed. Once you have done your initial search and brought up lists of results, you can use the year filter on the left to restrict to years that have already passed.

Part 1

Examine trends in infant mortality since the 1950s in your countries. Present the basic information you recover from the website as a simple table in which you have one column for each of the countries you chose, and one row for each year. You don’t need data for every single year. Every five or ten years is fine, perhaps 1950, 1955, 1960, etc. Depending on the country, data may not be available for some of the early years. In which country did infant mortality fall the most? Based on your examination of the data, in what era did infant mortality fall the fastest in developing countries? How did infant mortality change in developed countries? What has been happening recently?

Part 2

Do the same for life expectancy at birth, separately for males and females. In what period did life expectancy increase the fastest? In which country did life expectancy increase the most? What has been happening recently? Which of your countries has the widest gap between males and females?

Part 3

Look at trends in the total fertility rate (TFR) for your countries. Note that you should search for ‘total fertility’ rather than ‘total fertility rate’ since that is what the series are titled. Have rates declined over time? If so, when did they decline the fastest? What has happened in the last few decades?

SOSC 1860 F13 Assignment 3 Introduction to IPUMS

Due via TurnItIn on Monday 10/14 at midnight.

This assignment introduces you to the Integrated Public-Use Microsamples (IPUMS), the site at the University of Minnesota that many of you will use for your research paper due at the end of the semester. For this assignment, you will visit the site, collect some basic data for a state of your choice by using the online data analysis facility (http://usa.ipums.org/usa/sda/), and interpret it. Please read the instructions carefully and follow them step by step. If you follow the instructions carefully, you should be able to complete the assignment very quickly.

I am having you work with IPUMS not because it is the United States, but rather because right now, it is the largest, most detailed, and easiest to use website for analyzing a Census data. Right now there is no other online data that covers such a large population over such a long period of time (1850 to the present) with so many variables. Accordingly, it is ideal for introducing basic analysis of demographic data.

Before you start the assignment, please read the brief instructions for using the online data analysis facility at http://usa.ipums.org/usa/resources/sda/sdainstructions.pdf. Since we will be making heavy use of restrictions and selection filters, to ensure that only cases that meet specified criteria are included in the analysis, please read the description of restrictions at http://sda.usa.ipums.org/helpfiles/helpan.htm#range and the description of selection filters at http://sda.usa.ipums.org/helpfiles/helpan.htm#filter. Restrictions are applied when variables are specified in ROW, COLUMN, or CONTROL, whereas selection filters are specified in Selection Filter. Since we will also be recoding/transforming variables to simplify the output, for example, by grouping observations by age, please also read the description of transformation at: http://sda.usa.ipums.org/helpfiles/helpan.htm#recode

Note that this assignment will ask you to construct some nicely, presentation-quality tables and include them in the assignment you upload. You will almost certainly find it easiest to prepare tables first in Excel and then copy the resulting tables in Word. I have a blog entry explaining how to get results from IPUMS into Excel and then into Word that will make it easy for you to create fabulous looking tables:


I have posted a variety of other videos and tips for working with IPUMS (http://camerondcampbell.me/category/data/ipums/). You may want to get started with

Part 1

Since we will talk about population aging, we will start by looking at changes in the age composition of the country as a whole over the long term. Specifically, let’s look at the age distribution of the population in 1850, 1900, 1950, and 2000, focusing on the percentages of the population who were children (0-17 years), working age (18-59), and older (60+).

Since we will be making use of data from multiple years, at http://usa.ipums.org/usa/sda/ click on ‘United States, 1850-2009’ under ‘Use data from multiple years’: http://sda.usa.ipums.org/cgi-bin/sdaweb/hsda?harcsda+1850-2009.

This brings up a screen where you can specify the parameters of your analysis, which by default is a table with cells that represent counts of observations with different combinations of characteristics. Other, fancier options are available but for the time being we still stick with tabulations that produce results you can put into a table.

We would like to generate a 4×3 table in which the four columns correspond to the years 1850, 1900, 1950, and 2000, and three rows correspond to people aged 0-17, 18-59, and 60+, and the cells present the numbers of people of that age in that year, as well as that number as a ‘column percentage’, that is, as a percentage of all the observations in that year.

Since we only want data from 1850, 1900, 1950, and 2000 for our columns, for Column enter year(1850,1900,1950,2000). Since we want to group ages into three categories, 0-17,18-64, and 65+, specify the Row as age(r:0-17;18-64;65-*). The r: indicates that the values are to be recoded into the specified groups.

Please transcribe the column percentages and the column totals into a nicely formatted table based on the following template:

Age Distributions of the Population of the United States in 1850, 1900, 1950, and 2000
















Note that the entries in each column should sum to 100, since together they should account for the total population in that year. Column percentage is turned on by default on the screen so unless you change something by checking boxes for other percentages or unchecking column percentage, the percentages you see should be column percentages.

Please make sure you understand what is going on with the percentaging to make sure the numbers are being calculated in a way that makes sense. One recurring problem over the years has been that many students percentage their tables incorrectly, producing nonsensical results.

Please do prepare the table as described above. Don’t copy and paste the output directly, or print out the output and turn it in.

Write 2-3 sentences describing the trend that you see.

Part 2

Please redo 1, but limit to a state of your choice. Preferably it would be a state that was in the Union by 1850, or at least 1900, so that you can look at changes over time. You can restrict the calculation to the state of your choice by using the variable statefip, which is the FIPS code for each state. Its values are provided here:


To restrict to California, you would enter statefip(6)into the field for Selection Filter.

Make sure to prepare a nice table like the one in part 1 and include it in your submission.

Write two to three sentences comparing the state that you have chosen to country as a whole as represented in the results for 1.

Here is what the output for California looked like:

Part 3

We will now look at changes in the marital status of the population over time, since we will be discussing family change later in the semester.

The variable describing marital status is marst. Its values are described here: http://usa.ipums.org/usa-action/variables/MARST#codes_section

We will look at changes over time in the percentage of the adult population in different marital statuses.

This time, we want rows to correspond to years. We would like a little more detail on trends in the twentieth century, and are less interested in the period before 1900, so enter year(1900,1930, 1940,1950,1960,1970,1980,1990,2000) as your row variable.

Restrict your analysis to working-age adults by entering age(18-59) as your selection filter. Including the elderly affects observed trends because of the increases in the proportion of the population who are likely to be widowed. Note that this definition is different from the one used in 1 and 2.

The column should correspond to different marital statuses. To make life easier, let’s combine marital statuses, putting all the married into one category, all the separated and divorced into another:


Make sure that row percentage is checked, and column percentage is unchecked.

From the output, prepare a nicely formatted table in which the rows are years, the columns are the different marital statuses, the entry in each cell represents the % of people aged 18 and above in that year who have that marital status.

Write a few sentences commenting on the trends that you observe. Does anything in particular catch your attention?

Part 4

Now we will look at fertility as a function of age and education, using the variable fertyr (https://usa.ipums.org/usa-action/variables/FERTYR#description_section). Fertyr is included in the recent IPUMS data which is based on the annual American Community Survey. Women aged 15-50 were asked if they had a child in the last year. It is 0 if the variable is missing or not valid (for male respondents, people aged less than 15 or more than 50), 1 if a woman said she didn’t have a child in the last year, and 2 if she had a child in the last year.

Since we want to use ACS data, go to http://usa.ipums.org/usa/sda/ and click on ‘ACS 2001-2011’ under ‘Use data from multiple samples’: http://sda.usa.ipums.org/cgi-bin/sdaweb/hsda?harcsda+all_acs_samples

We are going to approach this calculation a bit differently, and introduce another capability of IPUMS. We want to compute the proportion of women who had a birth. We can do this by recoding fertyr so that 1 becomes 0, and 2 becomes 1. The average of the resulting 0’s and 1’s will be the proportion of women who had a birth.

When you reach the screen where you set up your analysis, mouse over ‘Analysis’ in the upper left, and then click on ‘Comparison of Means’ when it appears.

Once you reach the Comparison of Means screen, enter fertyr(r:0=1;1=2) as the Dependent Variable. Enter age as the row variable and educ(r:0-5;6;7-9;10;11)as the column variable. 0-5 groups people with less than a high-school education, 6 is people with a high school education, 7-9 is people with some college, 10 is college graduates, and 11 is people with some graduate school. Put age(r:15-19;20-24;25-29;30-34;35-39;40-44;45-49)as your row variable. Run the calculation, and use the results to produce a nice table that for each combination of education and age group identifies the proportion of women who have had a birth in the last year.

Write a few sentences about the patterns you observe.

Part 5

Please write about 250 words with some ideas about the project you would like to do for the class, hopefully using IPUMS data, but possibly using data from other Census sources. Explore the list of IPUMS variables at http://usa.ipums.org/usa-action/variables/group and find relevant variables are available. Make sure to name some of the variables you are interested in using. If you are especially interested in detailed economic variables, you might also want to explore the Current Population Survey: http://cps.ipums.org/cps-action/variables/group. The GSS is another possibility. If you are ambitious, you can also do something using IPUMS International. You may also propose to do a comparison between HK and the United States or some other country available at IPUMS International.

Your response must make clear that you have spent some time at the IPUMS website exploring the variables. In other words, it isn’t sufficient to simply say “I want to study marriage and education.” You would need to provide additional details that show me you spent time at the website, like names of the variables, the populations you might restrict to, and so forth.

Of course, you are welcome to talk to me about your ideas.