Chinese language announcement of our 2012 CMGPD summer course at SJTU

We’ve pretty much finalized the text of our announcement for our summer short course at SJTU this July.  I’ll produce an English language version pretty soon and post it.  This summer I will probably lecture in English so we encourage applications from non-Chinese speakers.               


上海交通大学中国东北历史与社会研究中心是上海交通大学人文学院与香港科技大学人文与社会科学学院合作建立的研究机构。中心第二期暑期学校于201276日-720日开课。课程以社会人口学与清代辽宁多代人口数据库的利用为核心,特聘香港科技大学人文社会科学院院长李中清(James Lee教授为课程顾问,加州大学(洛杉矶)社会学系康文林(Cameron Campbell教授主讲,全英文授课。

 清代辽宁多代人口数据库(China Multi-Generational Panel Dataset – Liaoning),是研究中国家庭与社会人口史的重要数据库,亦可以为人口行为、亲属与社会分层的过程研究,提供庞大的数据支持。数据库适合任何社会科学统计软件进行基本的统计分析。


中国清代双城多代人口数据库(China Multi-Generational Panel Dataset – Shuangcheng)是李中清与康文林工作组建立的中国清代及民国初年人口数据库,包含10万余人的近135万条记录。本数据库的原始资料为清代吉林将军双城堡旗人户口册,详细追踪记录了京旗、屯丁和浮丁三类旗人的人口和家户土地持有数量信息,是连续的人口与社会经济动态信息,为历史学、人口学、经济学等各项研究提供了珍贵的资料。为促进学术交流,数据库准备近期向世界开放。
正式开课时间 201276日—720
暑期学校地点:  上海交通大学人文学院
报名结束时间:  2012425日(请附个人简介与申请表)
学习条件:      提供免费学生宿舍住宿、学习资料;请自备电脑(推
报名方式:      请将您的个人简介与申请表,发送到
报名资格      熟悉STATA等统计软件,掌握回归分析等方法者等优先考虑。
学员名额      25


Announcement of 2012 CMGPD-LN Summer Course at SJTU

We’ve begun making our detailed plans for the 2012 CMGPD-LN Summer Course at Shanghai Jiaotong University.  The Chinese-language announcement is available at our SJTU Center website, via this link:  It will be July 6 to July 20.  Since there may be non-Chinese speaking participants this year, I will probably lecture in English.  The goal of the course is to introduce participants to management and analysis of the CMGPD-LN data, with special attention to using STATA to transform the data and create new variables as needed for different analyses.

Fertility rates using the births in last year variable from the ACS in IPUMS

[This is another note on using the SDA interface to analyze IPUMS that is intended for students in my Introduction to Social Demography.  I am posting it here rather than my class website because it may be of interest to others who are using the IPUMS for teaching.]

The ACS includes some very useful questions on demographic events within the last year, including births, marriages, and divorces within the last year.  Many students have indicated an interest in studying birth rates, so I am writing this note to provide some help on using the ACS data from 2001 to the present to calculate basic rates.

The ACS data on IPUMS includes a variable that indicates whether someone has had a birth in the last year.  It is 0 for cases where the information is not available, 1 if no birth occurred, and 2 if a birth occurred.  We can use this to approximate fertility rates if we restrict (using the filter) to observations where fertyr was 1 or 2, use comparison of means, and remember to subtract 1 from the means that appear in the table.  We need to subtract because 1 indicates no births, while 2 indicates a birth.

To calculate age-specific rates by year from 2001 to 2009, I set up a calculation with the following…

Dependent variable: fertyr
Row: age(r:10-14;15-19;20-24;25-29;30-34;35-39;40-44;45-49)
Column: year
Selection filter: fertyr(1-2) sex(2) age(10-49)

Also, under ‘Change number of decimal places to display’, I selected 3, so that whatever the mean was, subtracting 1 and multiplying by 1000 would yield a rate per thousand.

Here is the output.

To address the problem associated with fertyr being 1 for people who haven’t had a birth, and 2 for people who have had a birth, we could recode fertyr so that 0 means no births, and 1 means a birth.  In that case, the mean would actually be the proportion of people who have had a birth:

Dependent variable: feryr(0=1;1=2).

Of course, instead of using year as the column variable, one could use race, or some other variable of interest.

Just remember that what is reported in each cell of the output is the mean number of children in the last year plus one, so that when you prepare tables to turn in, you subtract one from each of the values in the cells in the output.

Comparing birth cohorts instead of time periods in the IPUMS

[Another note intended for students in my Introduction to Social Demography class who are using IPUMS-USA for their final projects, but which may be of interest to others using IPUMS in their courses]

Many students have expressed interest in examining time trends in average age at marriage, total number of children, completed education, and other phenomena that are fixed relatively early in life.  Looking at these numbers by Census year (i.e. by making year a row or column variable) is plausible, but doing so mixes together people who came of age in various eras, unless there is some carefully restriction on the ages of the people.

For example, looking at total number of children born for women in a single Census mixes together relatively young people who went through their childbearing recently, when birth rates were low, and people who went through childbearing earlier, when birth rates were high.  This makes comparison across Census years problematic.  Similar problems exist for average age at marriage, and so forth.  One approach is to use a filter to limit the women included in a comparison to a narrow age range which is easy to compare across census.

Another approach, however, is to use a recoded variable for year of birth as a row or column variable, and thereby compare men or women according to the era in which they were born.  This is fairly straightforward.

As an example, to look at trends in average age at marriage in successive birth cohort, I set up a comparison of means calculation.  The dependent variable to agemarr, the row variable to birthyr(r:1890-1899;1900-1909;1910-1919;1920-1929;1930-1939;1940-1949), the column variable to sex, and the filter to agemarr(1-99) age(40-50) birthyr(1890-1940).  I restricted to age 40-50 so that the calculation would only include people who had an opportunity to marry. Birthyr is limited to 1940 because agemarr is available only through 1980.  The result was the following:

If I wanted to do this by race, I could have set the column variable to race, and the control variable to sex.

If I wanted graph of the average ages for people born in individual years, I can specify the row variable as birthyr but with no recode, and then down below check ‘Suppress table’ (since it will have 50 rows, one for each year) and then under ‘Type of chart’ choose line chart.  The result is the following:

Of course, you could just as easily do this by race, or education, or something else.

As another example, I redid the calculation to look at mean number of children ever born (chborn).  I set the dependent variable to chborn, the row to birthyr(r:1850-1859;1860-1869;1870-1879;1880-1889;1890-1899;1900-1909;1910-1919;1920-1929;1930-1939;1940-1949), the filter to age(50-80) birthyr(1850-1940) chborn(1-*).  The restriction of chborn to 1 and higher reflects the fact that chborn is 0 for people for whom the information is not available, and 1+the number of the children for everyone else.  The filter for age being 50-80 restricts to women who are at least age 50, and have therefore completed their childbearing.  Thus chborn being 1 means 0 children etc.  In interpreting the results below, remember that the mean of chborn is one higher than the actual mean number of children.  To get the mean number of children, you need to subtract one.

You may want to at least consider this approach for any outcome that is fixed relatively early in life, and may vary a lot according to the era in which someone grew up.  Educational attainment would be another logical choice.

Using ethnicity/nativity variables in IPUMS to identify 1st/2nd/3rd+ generation

[These are some notes intended for students in my undergraduate Introduction to Social Demography class, for use in working on their final projects, but I thought they would be of wider interest to others using IPUMS in their teaching.]
Many students are interested in doing detailed comparisons of the social and demographic characteristics of specific ethnic groups.  In reviewing the project proposals, I saw that many students had used, or were planning to use, the detailed codes for the RACE variable.  I would strongly encourage everyone who is interested in a specific ethnic group to assess whether some of the other available variables like the self-reported ethnicity variables (ANCESTR1 and ANCESTR2) available in 1980, 1990, and 2000 might offer more cases and better resolution.

If you want to distinguish between 1st generation and later generation, you can filter on BPL, as described below.  Until 1970, a variable for father’s birthplace, FBPL, is also available, and as described below, can be used to identify the second generation.

·         First-generation immigrants (Born abroad)
o   Filter based on the birthplace variable, BPL.  Codes for BPL identify the state or country of birth of the respondent.  If you want to restrict to people born in a particular country, look up the code for that country, and use that code in the filter.  For, example, people born in Sweden would be identified by BPL being equal to 405, so in the filter field on the screen for specifying your tabulation you would enter bpl(40500) along with whatever filters are relevant to your calculation.
o   BPL codes:  Make sure to choose ‘Detailed’ rather than ‘General’ so that you see the 4 or 5 digit codes that you will use in your filters.  If you use the 3 digit codes listed under the ‘General’ view, the filter will not work properly.
o   With the detailed codes that you will use in your filter, there may be multiple codes corresponding to the same country, because there are different codes for regions in the same country, especially if during the nineteenth century, these regions were separate countries.  For example, Canada is 15000-15083, Germany is 45300-45362.
·         The second generation (born in the U.S. to a parent who was born abroad)
o   You can identify people with a parent born abroad by use of the Father’s Birthplace (FBPL) or Mother’s Birthplace variable (MBPL).  To make things consistent, please base your definition of the ‘second generation’ on the father’s birthplace (FBPL). 
o   To ensure that you are considering individuals born in the United States to father or mother who was born abroad, combine a filter based on father’s birthplace being in the country of interest (FBPL) with a filter based on own birthplace (BPL) being the United States.
o   For example, to limit your tabulation to records of second-generation Swedish-Americans, that is people born here to Swedish fathers, you would include bpl(100-12092) fbpl(40500) in your filter.
o   With the detailed codes that you will use in your filter, there may be multiple codes corresponding to the same country, because there are different codes for regions in the same country, especially if during the nineteenth century, these regions were separate countries.  For example, Canada is 15000-15083, Germany is 45300-45362.
o   In the dataset that is available for online analysis, fbpl is only available through 1970.  It isn’t provided in the 1980, 1990, and 2000 data that are available online.   So your tabulations involving fbpl will normally end in 1970.
·         Second  and later generation, 1980-2000 (born in the U.S., but claiming an ethnicity)
o    In 1980, 1990, and 2000, the Census form included a question about ethnicity.  The response is in the variables ANCESTR1 (for the first response) and ANCESTR2 (for the second response).  The responses, as we know from the article by Hout and Goldstein on the Irish-American population, are highly subjective.  Nevertheless it does allow you to get a picture of an ethnicity that includes more than just the first- or second-generation.
o   Identify the second and later generations as people who specified your ethnicity of interest in ANCESTR1, but who also indicated that they were born in the U.S.
o   ANCESTR1 codes (note that values differ from the BPL and FBPL codes):  Make sure to choose the ‘Detailed’ view rather than the ‘General’ view so that you can see the three digit codes you will need for your filter.
o   For example, to limit your tabulations to records of second- and later-generation Swedish-Americans 1980-2000, you would add bpl(100-12092) ancestr1(890-900) to your filter.
o   Do not include fbpl in your filter here because in the dataset that is available for online analysis, fbpl is not included in 1980, 1990, 2000.  If you filter on on ancestr1 and fbpl you will end up with no cases.
o   Note that because responses on the ancestry question were open-ended, and people sometimes responded with a region in a particular country rather than the country itself, to ensure you get everyone associated with a particular country, you may need to specify a range of codes.  For example, Italian includes all the codes from 510 (Italian) to 730 (Venetian), so to pick up all the people who might plausibly be claimed to be Italian, you would specify ancestr1(510-730) in your filter.
·         Native-born population of the U.S.
o   Include bpl(100-12092) in your filter.

Using ‘comparison of means’ to calculate proportions at IPUMS-USA

(I wrote this for the students in my undergraduate lecture course Introduction to Social Demography. They are working with IPUMS-USA for a final project.  I thought it might be of more general interest to others who are using IPUMS-USA for each.)

We often want to calculate the proportion of people with some characteristic according to the values of two other variables.  The characteristic of interest might be represented by a single value of a categorical variable, or one or more values of a categorical variable, or even a range of values in a continuous variable.  We can do this with the ‘comparison of means’ tab that we use to compute the mean of income, socioeconomic index, or other continuous variables.  We just have to recode the categorical variable that we are interested in into a dichotomous variable that is 1 if the person has the characteristic we are interested in, and 0 otherwise.

For example, we might want to calculate the proportion of people who have ever been married, according to year and age group.  By ‘ever been married’, we mean anyone who is currently married, or was married in the past, but is now widowed, separated, or divorced.  In the MARST variable for marital status, that would be anyone who had values 1-5.  The remaining value, 6, corresponds to people who have never been married.

Of course, we could do a cross-tabulation in which our column variable was marital status, our row variable was age, and our control variable was year.  We could add up the percentages of people in statuses 1-5 in the various tables.  Of course, we could recode 1-5 into one category and have the computer do the addition for us, but we would still end up with a lot of output to go through.

Alternatively, we could recode marital status into a dichotomous variable that takes on the value of 0 or 1 according to whether someone has ever been married, and then compute the mean of that new variable for different combinations of year and age group.  In the following example, I have set up a ‘comparison of means’ calculation in which the dependent variable is MARST recoded so that all values corresponding to categories where a person is currently married or was married in the past (MARST 1 through 5) are 1, and the never married are 0.  The mean of this variable will be the proportion of people who are married, or were married in the past but are now widowed, separated, or divorced.

In the following, pay particular attention to the use of recode in the specification of the dependent variable to turn marst into a dichotomous variable:

 1 proportion_ever_married_by_age_and_year


Below is an example setting up a calculation to calculate proportions enrolled in school.  School enrollment is originally coded so that 1 indicates that someone is not enrolled, and 2 indicates that they are enrolled.  We recode to change 1 to a 0, and 2 to a 1, so that the mean ends up being the proportion currently enrolled.  Note that for the school enrollment variable, it only makes sense to consider people who are at the right age to be enrolled in school.

2 enrollment_example

Of course, you could do this with any number of other variables, including variables that were originally numeric or continuous.  In the example below, I have transformed POVERTY so that it is 0 or 1 according to whether the household in which an individual lives is at or below the poverty line.  POVERTY is originally coded as a three digit number that represents the household’s income as a percentage of the poverty line.  100 means that a household is at the poverty line, 001-099 means that a household is below the poverty line, and 101 up to 500 means that a household is above the poverty line.  There are no values above 500 because POVERTY is top-coded: if a household is earning more than 500% of the poverty line, it is just set to 500.  In the specification of the dependent variable, I have used the recode facility to change all values of poverty that are 101 or higher to 0, and all values of 001 to 100 to 1.  The mean of the variable is therefore the proportion of people living in poverty.  Note that the recode excludes 0 because 0 indicates that the value is not available.

3 poverty_recode_example

The value in each cell represents the proportion of individuals of the specified race in each year who are in poverty.