Blog Posts

Summer 2012 China Multigenerational Panel Dataset class at SJTU (English announcement)

The Shanghai Jiaotong University Center for the History and Society of Northeast China was established as a research unit by a collaboration of the Shanghai Jiaotong University (SJTU) School of the Humanities and the Hong Kong University of Science and Technology (HKUST) School of the Humanities and Social Sciences. The Center’s second summer school will be held from July 6 to July 20. The class will focus on the use of the China Multigenerational Panel Datasets – Liaoning (CMGPD-LN) in the study demography, stratification, and social and family history. It will also preview a new dataset, the China Multigenerational Panel Dataset – Shuangcheng (CMGPD-SC) that we plan to release in 2013. HKUST Distinguished Professor and Dean of Humanities and Social Sciences advises on the organization and content course. UCLA Professor of Sociology Cameron Campbell will lecture.  If any non-Chinese speakers enroll, the lectures will be in English, otherwise lectures may be in Chinese.

These datasets are complex in many ways: longitudinal, multi-generational, and structured at multiple levels, including the individual, the household, the kin group, the community, the administrative unit, and the region.  Fully exploiting the potential offered by these data requires application of sophisticated techniques in STATA or other statistical packages to manage the data, create variables, and carry out analysis.

This class is intended to introduce students to advanced techniques required to manage and analyse the CMGPD datasets, thereby equipping them to make use of the CMGPD-LN and CMGPD-SC in their own research.


China Multigenerational Panel Dataset – Liaoning (CMGPD-LN)

The CMGPD-LN is an important dataset for the study of China’s family, social and demographic history, and for the study of demography and stratification more generally. The dataset is suitable for application of a wide variety of statistical techniques that are commonly used in social demography for the analysis of longitudinal, individual-level data, and available in the most popular statistical software packages. The dataset is distinguished by its size, temporal depth, and richness of detail on family, household and kinship context.

The materials from which the dataset was constructed are Shengjing Imperial Household Agency household registers held in the Liaoning Provincial Archives. The registers are triennial. Altogether there are 3600 of them. We transcribed a subset of them to produce the CMGPD-LN, which spans 160 years from 1749 to 1909. At present, the dataset comprises 29 register series, and consists of 1,500,000 records that describe 260000 individuals over seven generations. The CMGPD-LN is accordingly an important resource for the study of historical demography, sociology, economics, and other fields.

The CMGPD-LN and associated English-language documentation are already available for download at ICPSR, following a free registration. Please visit the website:

China Multigenerational Panel Dataset – Shuangcheng (CMGPD-SC)

The CMGPD-SC covers communities of recent settlers in Shuangcheng, Heilongjiang in the last half of the nineteenth century and beginning of the twentieth. It contains 1.35 million records that describe 100,000 people. The registers cover descendants of urban migrants from Beijing and rural migrants from neighboring areas in northeast China who came to the area in the first half of the nineteenth century as part of a government organized effort to settle this largely vacant frontier region. One of the distinguishing features of this dataset is the availability of linked, individual-level landholding records for several points in time. The data also include a rich array of other indicators of household and family context and socioeconomic status. We anticipate formal public release of the dataset via ICPSR in 2013 or 2014. We will provide participants in the summer class with access to drafts of the release and documentation.

Topics to be Covered in Class

1. Review of relevant research in related topics in social demography

2. Results on topics in social and family demography from CMGPD-LN 
3. Advanced techniques in STATA for the management and analysis of the CMGPD-LN data.  
4. Preview of the CMGPD-SC
July 6, 2012 to July 20, 2012
Shanghai Jiaotong University School of Humanities (SJTU Minhang Campus, Shanghai)
Application deadline
April 25, 2012 (see link below for application)
Application procedure
Please send your personal statement and application form as attachments to  We will have an English language application form available soon.
Applications from faculty and graduate students are welcome.  Applications from undergraduates may be considered if they have already been accepted into a graduate program beginning fall 2012.  Students should already be able to conduct basic operations in STATA, and should also have completed a basic course in linear regression.

We anticipate being able to accommodate 25 students. 
Students will be offered free housing in dormitories at SJTU.  Students who want other accommodations will have to arrange them on their own and pay for them.  Students should bring their own computer, with STATA or another statistical package already installed.  Students already familiar with other statistical packages may use them, but we will only be able to provide support to student using STATA.  Students are responsible for travel and local expenses.

Chinese language announcement of our 2012 CMGPD summer course at SJTU

We’ve pretty much finalized the text of our announcement for our summer short course at SJTU this July.  I’ll produce an English language version pretty soon and post it.  This summer I will probably lecture in English so we encourage applications from non-Chinese speakers.               


上海交通大学中国东北历史与社会研究中心是上海交通大学人文学院与香港科技大学人文与社会科学学院合作建立的研究机构。中心第二期暑期学校于201276日-720日开课。课程以社会人口学与清代辽宁多代人口数据库的利用为核心,特聘香港科技大学人文社会科学院院长李中清(James Lee教授为课程顾问,加州大学(洛杉矶)社会学系康文林(Cameron Campbell教授主讲,全英文授课。

 清代辽宁多代人口数据库(China Multi-Generational Panel Dataset – Liaoning),是研究中国家庭与社会人口史的重要数据库,亦可以为人口行为、亲属与社会分层的过程研究,提供庞大的数据支持。数据库适合任何社会科学统计软件进行基本的统计分析。


中国清代双城多代人口数据库(China Multi-Generational Panel Dataset – Shuangcheng)是李中清与康文林工作组建立的中国清代及民国初年人口数据库,包含10万余人的近135万条记录。本数据库的原始资料为清代吉林将军双城堡旗人户口册,详细追踪记录了京旗、屯丁和浮丁三类旗人的人口和家户土地持有数量信息,是连续的人口与社会经济动态信息,为历史学、人口学、经济学等各项研究提供了珍贵的资料。为促进学术交流,数据库准备近期向世界开放。
正式开课时间 201276日—720
暑期学校地点:  上海交通大学人文学院
报名结束时间:  2012425日(请附个人简介与申请表)
学习条件:      提供免费学生宿舍住宿、学习资料;请自备电脑(推
报名方式:      请将您的个人简介与申请表,发送到
报名资格      熟悉STATA等统计软件,掌握回归分析等方法者等优先考虑。
学员名额      25


Announcement of 2012 CMGPD-LN Summer Course at SJTU

We’ve begun making our detailed plans for the 2012 CMGPD-LN Summer Course at Shanghai Jiaotong University.  The Chinese-language announcement is available at our SJTU Center website, via this link:  It will be July 6 to July 20.  Since there may be non-Chinese speaking participants this year, I will probably lecture in English.  The goal of the course is to introduce participants to management and analysis of the CMGPD-LN data, with special attention to using STATA to transform the data and create new variables as needed for different analyses.

A modest proposal for facilitating data-driven choices of college and majorent choice of colleges and majors to be data-driven

I recently came across this article about a study at Georgetown looking at the employment prospects and average incomes associated with various majors.  Here is the page at Georgetown devoted to the study itself.

My own specialty isn’t higher education, but I’ve been thinking about it much more recently for a variety of reasons.  One is my involvement in a study of long term trends in the social class origins of students at elite Chinese universities led by my longtime collaborator James Lee, and which includes a large number of outstanding collaborators, of whom I am but one.  For that study, we have been looking for comparison points internationally.  I have been struck by the fact that with the remarkable exception of the University of California, which offers student data in aggregate form at its Statfinder website, there is very little in the way of systematic and comparable aggregated data on student characteristics and outcomes from institutions of higher education, reflecting what I increasingly see as a troubling and probably deliberate lack of transparency.  More importantly, we have been talking about how to follow up this initial study by looking at outcomes for graduates, inspired by the Harvard and Beyond study.  I should add that my outrageous proposal below for a nationwide data collection system on student characteristics and outcomes is but a pipe dream, and has little to do with anything we are hoping to do in our own studies.  Another reason for my interest is simply a recent uptick in the numbers of conversations with colleagues here and elsewhere about what we can do to improve undergraduate education to make it more engaging and rewarding for students.

The conclusions summarized in the Georgetown study article about the differences between the various majors are pretty much as I would expect.  I haven’t looked at the study in enough detail to comment on its methodology or its data, but I applaud the general idea of collecting and analyzing the data on the socioeconomic outcomes of different majors, and for that matter, different types of colleges.  In this day and age, there is really little excuse for the choice of college and major not to be more data driven.

Right now, too many students fly blind when they choose a college and a major.  They make choices about college based on somewhat relevant criteria like the overall academic reputation of the school, as well as largely irrelevant criteria like the physical appearance of the campus, its geographic location, the success of its sports teams, what they’ve heard from friends, or reviews at various websites.  While the choice of major may be very personal, it may also be based on very limited information that students may have acquired in high school or in their first year in college, and may also reflect undue optimism about the prospects after graduation.

Choosing a college and major based on such limited and sometimes uninformative information reflects the general lack of easily accessible data on the outcomes of students from different colleges and majors.  Studies like the one done at Georgetown are relatively uncommon, and typically have limitations that limit their usefulness for planning.  For example, as comprehensive as the Georgetown study is, its reliance on the American Community Survey precludes comparisons of salaries and employment according to the types of institutions that students attended, or their own academic qualifications.

Other published studies by economists have sought to quantify the rewards associated with different majors after accounting for the prestige of the institution and the qualifications of the students, these are also limited in terms of their usefulness for student planning because they typically don’t identify specific institutions, or specific majors.  Such studies typically rely on data from panel surveys that don’t have enough respondents to drill down to specific combinations of institution and major.  Even if such detail were available, the surveys haven’t been in place long enough to look at earnings and employment over the entire career.  Usually they will only have information on outcomes a few years out of college.

In an ideal world, high school seniors deciding which schools to apply to, or which school to attend, would be able to visit a website that would let them see what employment and earning outcomes were like for students with academic qualifications like theirs who graduated of a specified institution and major.  They would be able to enter their SAT or other standardized test scores, their GPA, perhaps some information about their high school, and the name of prospective institution and major, and see how students who resembled them were doing 1, 5, 10, and 20 years after graduation.

At least in principle, this should be possible by linkage of various administrative databases and creation of a student tracking system similar to the ones that many states or school districts are already putting in place for K-12 education.  The federal government to make use of the power it has as the key source of funds for student grants and loans, and faculty research, to demand that academic institutions that receive federal funds comply with participation in a national tracking system that would follow students from senior year in high school through college and into the labor market.  Compliance would involve providing detailed data on applicants, acceptances, and matriculating students, including their academic qualifications as applicants and their subsequent performance in college.  These data would be collected and held in a secure site such as already exist for various forms of administrative data, and could be linked to administrative data on subsequent earnings of graduates from Social Security or various state agencies.

The resulting linked database would allow for a student contemplating a particular combination of college and major to see what prospects were like for someone like themselves.  In many cases, it would help clarify the potential consequences of different choices.  By providing an empirical basis for making important choices, it would probably decrease the influence of less relevant and useful information such as the overall reputation of institutions and majors.  In many cases, I suspect it would help level the playing field between public and private schools and between elite and non-elite schools by confirming in a very convincing way that students who seek to maximize income are generally better off pursuing engineering at a state school than pursuing a liberal arts major at a private university.  There are already academic studies that suggest this, but students need to see results for specific institutions and majors.

What I have in mind is something like the Consumer Reports Used Car Guide where different makes of car from different model years are rated on a variety of criteria based on surveys of owners.  Except in this case, a student could type in their SAT score, their high school GPA, and some other information, and a list of institutions and majors, and get back out some kind of assessment of the average incomes and employment rates of students like themselves at different points in time after graduation.

The suggestion that students should explicitly consider employment prospects and income when choosing institutions and majors may sound cold-blooded and crass, but I would argue that the information should at least be available, and considered alongside whatever information students have available to them.  While many very admirable students have the combination of passion and financial wherewithal to pursue an esoteric major at an expensive private university without worrying about going into debt, the reality is that right now too many students go deeply into debt pursuing degrees that will do nothing for them after they graduate, at expensive institutions of dubious quality.  If they had made their choice based on complete information about the likely prospects of someone with their qualifications who attended that institution and pursued that major, it would be their fault. But too often students choose institutions and majors that do nothing for them because they don’t really have enough useful information available to them, and they have to rely on fundamentally uninformative or irrelevant factors like the reputation of the institution, or some very limited exposure to a particular field in high school or early in their college career.

To go even further out on a limb, I would to see such information about institutions and majors used in making decisions about student grants and loans.  Perhaps it is already, but I don’t know enough about how the system works.  A student who wants to study engineering at a state school should receive more support in the form of grants and loans than a student who wants to study something less practical at an expensive private institution.  If they do receive loans, the limits should be much higher and the interest rates much lower.  Essentially, public investments in individual education in the form of grants or loans should be made according to the same principles as loans in general are made, in the sense that the loan amount and interest rate should be based on the likelihood of it being paid back.  The recent efforts to reign in student loans at for-profit colleges seem like a step in the right direction, in terms of making the allocations data driven, but there is no reason that this principle shouldn’t be extended.

I suspect that making the choice of college and major more data-driven and focused on results for graduates would pressure colleges to redirect their attention away from investments in fancy buildings, star faculty, and sports facilities and emphasize investments that increase the ‘value-added’ of undergraduate majors.  In an ideal world, it would lead to a reorganization of the undergraduate experience where there was more emphasis on the overall design of a major and thought given to the intended ‘product’ and less of the unsavory horse trading that seeks to ensure that the courses that faculty enjoyed teaching were listed as requirements.

Personally, I would like to see a much smaller number of majors, each focused on a recognized discipline, and each with its own distinct theoretical framework, evidentiary basis, and set of methods.  I’m not arguing for turning college into vocational training, rather that majors have more internal consistency and coherence in terms of theory, substance, and method so that graduates are ‘branded’.  This already is the case in engineering and the natural and life sciences, where the content of a physics, chemistry, engineering or biology major is broadly similar across different institutions, but not at all the case in the behavioral or social sciences, or the humanities.  I’ll get into this issue with specific reference to the social sciences in another blog post, but the point remains that as far as I can tell, many humanities and social sciences do not reflect much evidence of a guiding intellect in their design, and at any given institution seem to reflect a path dependent process of addition or deletion of requirements and electives according to the configuration of faculty interests.

Of course, I realize this proposal for large scale collection of longitudinal data on all college bound students from senior year in high school into middle adulthood is wildly unrealistic, most importantly because colleges would object to it.  One thing I have noticed is that colleges don’t seem to like transparency with respect to the characteristics or outcomes of their students that would facilitate comparison shopping based on overall outcomes.  They prefer to control information and report on positive outcomes like successful alumni, and then compete with other institutions on intangibles like reputation.  To the extent that they provide information, it is for the increasingly common and silly college ranking exercises, and that information is generally provided in aggregate form that is easy to manipulate.

With regard to transparency, I would like to give a shout out to my employer, the University of California, which at least provides detailed tabular data on the characteristics of students at their remarkable website:  This is where we should be headed in terms of provision of information to support decision-making.  Visitors at this amazing site can tabulate students according to the socioeconomic profile of their families, ethnicity, geographic origin, and any number of other variables.  They can also look up persistence rates, GPA, and graduation rates by class.  Basically, what we need is something like the University of California Statfinder for ALL institutions of higher education combined, and with additional information about student outcomes after graduation.

Revising the syllabus for my Chinese society class (Sociology 181B)

I have started reworking my syllabus for my upper division Chinese society class (Sociology 181B) which I will be teaching again this spring, after a bit of a hiatus.  I have an exciting opportunity to redo the design of the class from scratch.  After C.K. Lee joined the department here, we decided to take advantage of the complementarity of our research interests to turn what had been a one-quarter course that covered everything under the sun and inevitably was a mile wide and an inch deep into a comprehensive two-quarter sequence.  In the past, when I taught the course, inevitably I emphasized topics like family, population, and inequality because they reflected my own interests.  I tried to cover social movements, politics, labor, and other topics, but I’ll be the first to admit I couldn’t really do them justice.

Now that we have a two-quarter sequence, I can devote the entire quarter to my own areas of expertise, specifically family, population, and stratification.  I have rearranged the schedule accordingly, giving entire lectures to topics that in the past I dispensed with in one-third of a lecture.  I’m also taking the opportunity to overhaul the readings since there is so much new scholarship in the last few years.  Of course, the real problem is finding readings that address the most recent social phenomena, that have not yet been subject to scholarly studies, or aren’t even amenable to the sorts of quantitative analysis that I am used to.  I’ll be poking around over the next few weeks.

For the benefit of students who are already looking around for courses for spring quarter, here is a link to the tentative syllabus:

The readings are going to change substantially, but the schedule itself should provide a pretty good idea of what topics I will cover and how much attention each will receive.

The inevitable challenge is teaching a course that introduces Chinese society, but is also sociological, in the sense of being embedded in the broader questions that are of concern to the discipline.  It would be easy to teach a Chinese society class that would be a ten week version of the country introduction in a tour guide, and was a series of sensational or at least journalistic stories and anecdotes about contemporary Chinese society.  I could teach a course like that and it would probably be lots of fun for everyone, but it would be a disservice to the students.  My approach has been to embed my discussions into broader themes related to East/West comparison, demographic theory, stratification, and so forth.  But its an ongoing effort.

Following my usual practice, I’ll also have recommended reading that is not necessarily scholarly, but vividly illustrates many of the issues covered in the lectures and formal reading.  For quite some time I required Qiu Xiaolong’s excellent Death of a Red Heroine as a sort of companion to the reading on contemporary urban China, but this time around I will try having the students read Peter Hessler’s excellent Country Driving, since that covers such a wide swath of contemporary Chinese society.

As I’ve been thinking about my reorganization of the class, I’ve also been reflecting on how fortunate my department is to have so much depth in Chinese studies.  Most of the major sociology departments in the United States have only one person whose primary research focus is China.  Obviously there are prominent exceptions like Stanford, which has Zhou Xueguang and Andy Walder.  Here at UCLA, however, we have three colleagues who work primarily on China, or have at least one major ongoing research projects in China: C.K. Lee, Min Zhou, and myself.  And of course our emeritus colleague Don Treiman remains active with various projects in China and elsewhere.  I hope that in the future, this becomes the norm as opposed to the exception, and it becomes typical for departments to have multiple colleagues carrying out research on Chinese society.  One can only hope.

Our paper on trends in the social origins of students at elite Chinese universities

Our paper on the long-term social origins of students at Peking University and Suzhou University has appeared in China Social Science (中国社会科学). The paper’s title is “无声的革命:北京大学与苏州大学学生社会来源研究 1952-2002 (Silent Revolution: Research on the Social Origins of Peking University and Suzhou University Students, 1952-2002).” The lead authors were James Lee/李中清 (HKUST) and LIANG Chen/梁晨 (Nanjing University) and there were six additional co-authors, including myself.

My own role was fairly small, and limited largely on advising on the statistical analysis, and participating in discussions of the implications of the results. But it is an important paper, and I would rather make a minor contribution to an important paper than make a major contribution to an unimportant one. I already do a lot of the latter.

Here is the announcement of the issue that includes the paper at the China Social Science website:

Here is a place at the China Social Science website where you can view a complete abstract and download the article:

The paper presents many novel empirical findings on trends in the social origins of the students at these two universities. In my mind, the most important is the demonstration that during the period covered by the analysis, the percentage of students from farming and working class origins was much higher than at national and regional elite universities in the US.

Perhaps the only elite schools in the US in which students from modest socioeconomic origins are so well represented are the University of California campuses, including UCLA. I was just at a meeting yesterday where some basic tabulations were presented on the socioeconomic characteristics of entering freshmen at UCLA and I was pleased to see that we continue to admit and enroll large numbers of students who are first-generation college students, or from families of relatively low socioeconomic status. Based on what I have seen in tabulations from the annual Freshman Survey carried out by the Higher Education Research Institute here at UCLA, in the United States the most selective privates admit a large share of their students from high income families. Only a small portion come from modest origins.

If you can lay your hands on a copy of 中国社会科学, the full reference is

梁晨 (LIANG Chen), 张浩 (ZHANG Hao), 李兰 (LI Lan), 阮丹青 (RUAN Danching), 康文林 (Cameron Campbell), 杨善华 (YANG Shanhua), 李中清 (James Lee). 2012. “无声的革命:北京大学与苏州大学学生社会来源研究 (1952-2002) (Silent Revolution: Research on the Social Origins of Students at Peking University and Suzhou University, 1952-2002).” 中国社会科学 (Chinese Social Science). 2012(1):98-118.
For those of you who can read Chinese, here is the abstract:

1949 年以来, 中国高等教育领域出现了一场革命。高等精英教育生 源开始多样化, 以往为社会上层子女所垄断的状况被打破, 工农等社会较低阶层子 女逐渐在其中占据相当比重, 并成功地将这一比重保持到20 世纪末。基础教育的 推广、统一高考招生制度的建立以及重点中学的设置等制度安排共同推动了无声革 命的出现。这场革命虽然不及社会政治革命那样引人瞩目, 却同样意义深远。本研 究利用1952 — 2002 年间北京大学和苏州大学学生学籍卡片的翔实材料, 力图将这 一革命及其成就呈现出来, 为中国高等教育改革与发展提供借鉴.

Because much of the online discussion of our article has focused on what appears to be an increase in the share of students whose father and/or mother are cadres, James Lee and Liang Chen have provided some additional details on this trend to help clarify some key underlying features.  Below, I have added this material to this blog entry, on 3/26/2012.  We are preparing additional materials to help ‘unpack’ the findings in the article and clarify some of the key trends.
Additional points re the increase in the proportion of students whose father and mother was a cadre (from James Lee and LIANG Chen) 

Recently there has been considerable interest in our research finding that the proportion of cadre children at PKU increased during the last quarter of the twentieth century from 11 percent in 1976 to 38 percent in 1999.

This finding which was published in《无声的革命:北京大学与苏州大学学生社会来源研究(1952-2002)》中国社会科学杂志 2012 年 1 期 is based on an analysis of the social origins of some 150,000 undergraduate students who entered Peking University and Suzhou University in the last half of the twentieth century.

The article also shows several other important discoveries.

1. Based on the analysis of Suzhou University undergraduates, while the overall proportion of cadre children similarly increased, the proportion of cadre children who are from explicitly political cadre families in fact declines from 85 percent in 1965 to fewer than 45 percent in 1999

2. The proportion of Suzhou University cadre children who are from commercial enterprise cadre families, however, increases from 3.4 percent in 1976 to over 43 percent in 2001.

3. At the same time, the proportion of children of factory workers also increases from 13 percent in 1992 to 22.4 percent in 1999 at Peking University and from 11.4 percent in 1989 to 24.4 percent in 2001 at Suzhou

In fact, overall the proportion of children from blue collar families remains roughly stable at Peking University during the last quarter of the twentieth century and increases during this period at Suzhou University.

Overall by international standards, Chinese elite university admissions as demonstrated by these two universities were and continue to be remarkably open to children from non-elite families.


最近,我们的一项研究发现,北大学生中干部子女的比例从1976年接近11%增加到了1999年的近38%,这引发了社会各界地广泛关注和持续讨论。实际上,这是我们对上世纪后半叶北京大学和苏州大学招收的共约15万名本科生社会来源研究的发现之一,该研究名为《无声的革命:北京大学与苏州大学学生社会来源研究(1952-2002)》,发表在《中国社会科学》杂志 2012 年第1期上。 其实,我们的研究至少还有其他三个重要发现值得注意:

1.              同北大类似,苏大学生中的干部子女在改革开放以后也有持续的增长,但在干部群体内部,党政干部的比例已经从1965年的85%下降到了2001年的40%
2.              与此相反,苏大干部子女中的企业干部子女比例却从最低谷1976年的3.4%增加到了2001年的43%,超越党政干部成为干部子女的最大来源。
3.              同时,两校的工人子女比例也都有明显增长。其中,北大的工人子女比例从1987年的13%增加到了1998年的22.4%;苏大的工人子女比例从1989年的11.4%增加到了2001年的24.4%



Fertility rates using the births in last year variable from the ACS in IPUMS

[This is another note on using the SDA interface to analyze IPUMS that is intended for students in my Introduction to Social Demography.  I am posting it here rather than my class website because it may be of interest to others who are using the IPUMS for teaching.]

The ACS includes some very useful questions on demographic events within the last year, including births, marriages, and divorces within the last year.  Many students have indicated an interest in studying birth rates, so I am writing this note to provide some help on using the ACS data from 2001 to the present to calculate basic rates.

The ACS data on IPUMS includes a variable that indicates whether someone has had a birth in the last year.  It is 0 for cases where the information is not available, 1 if no birth occurred, and 2 if a birth occurred.  We can use this to approximate fertility rates if we restrict (using the filter) to observations where fertyr was 1 or 2, use comparison of means, and remember to subtract 1 from the means that appear in the table.  We need to subtract because 1 indicates no births, while 2 indicates a birth.

To calculate age-specific rates by year from 2001 to 2009, I set up a calculation with the following…

Dependent variable: fertyr
Row: age(r:10-14;15-19;20-24;25-29;30-34;35-39;40-44;45-49)
Column: year
Selection filter: fertyr(1-2) sex(2) age(10-49)

Also, under ‘Change number of decimal places to display’, I selected 3, so that whatever the mean was, subtracting 1 and multiplying by 1000 would yield a rate per thousand.

Here is the output.

To address the problem associated with fertyr being 1 for people who haven’t had a birth, and 2 for people who have had a birth, we could recode fertyr so that 0 means no births, and 1 means a birth.  In that case, the mean would actually be the proportion of people who have had a birth:

Dependent variable: feryr(0=1;1=2).

Of course, instead of using year as the column variable, one could use race, or some other variable of interest.

Just remember that what is reported in each cell of the output is the mean number of children in the last year plus one, so that when you prepare tables to turn in, you subtract one from each of the values in the cells in the output.

Comparing birth cohorts instead of time periods in the IPUMS

[Another note intended for students in my Introduction to Social Demography class who are using IPUMS-USA for their final projects, but which may be of interest to others using IPUMS in their courses]

Many students have expressed interest in examining time trends in average age at marriage, total number of children, completed education, and other phenomena that are fixed relatively early in life.  Looking at these numbers by Census year (i.e. by making year a row or column variable) is plausible, but doing so mixes together people who came of age in various eras, unless there is some carefully restriction on the ages of the people.

For example, looking at total number of children born for women in a single Census mixes together relatively young people who went through their childbearing recently, when birth rates were low, and people who went through childbearing earlier, when birth rates were high.  This makes comparison across Census years problematic.  Similar problems exist for average age at marriage, and so forth.  One approach is to use a filter to limit the women included in a comparison to a narrow age range which is easy to compare across census.

Another approach, however, is to use a recoded variable for year of birth as a row or column variable, and thereby compare men or women according to the era in which they were born.  This is fairly straightforward.

As an example, to look at trends in average age at marriage in successive birth cohort, I set up a comparison of means calculation.  The dependent variable to agemarr, the row variable to birthyr(r:1890-1899;1900-1909;1910-1919;1920-1929;1930-1939;1940-1949), the column variable to sex, and the filter to agemarr(1-99) age(40-50) birthyr(1890-1940).  I restricted to age 40-50 so that the calculation would only include people who had an opportunity to marry. Birthyr is limited to 1940 because agemarr is available only through 1980.  The result was the following:

If I wanted to do this by race, I could have set the column variable to race, and the control variable to sex.

If I wanted graph of the average ages for people born in individual years, I can specify the row variable as birthyr but with no recode, and then down below check ‘Suppress table’ (since it will have 50 rows, one for each year) and then under ‘Type of chart’ choose line chart.  The result is the following:

Of course, you could just as easily do this by race, or education, or something else.

As another example, I redid the calculation to look at mean number of children ever born (chborn).  I set the dependent variable to chborn, the row to birthyr(r:1850-1859;1860-1869;1870-1879;1880-1889;1890-1899;1900-1909;1910-1919;1920-1929;1930-1939;1940-1949), the filter to age(50-80) birthyr(1850-1940) chborn(1-*).  The restriction of chborn to 1 and higher reflects the fact that chborn is 0 for people for whom the information is not available, and 1+the number of the children for everyone else.  The filter for age being 50-80 restricts to women who are at least age 50, and have therefore completed their childbearing.  Thus chborn being 1 means 0 children etc.  In interpreting the results below, remember that the mean of chborn is one higher than the actual mean number of children.  To get the mean number of children, you need to subtract one.

You may want to at least consider this approach for any outcome that is fixed relatively early in life, and may vary a lot according to the era in which someone grew up.  Educational attainment would be another logical choice.

Using ethnicity/nativity variables in IPUMS to identify 1st/2nd/3rd+ generation

[These are some notes intended for students in my undergraduate Introduction to Social Demography class, for use in working on their final projects, but I thought they would be of wider interest to others using IPUMS in their teaching.]
Many students are interested in doing detailed comparisons of the social and demographic characteristics of specific ethnic groups.  In reviewing the project proposals, I saw that many students had used, or were planning to use, the detailed codes for the RACE variable.  I would strongly encourage everyone who is interested in a specific ethnic group to assess whether some of the other available variables like the self-reported ethnicity variables (ANCESTR1 and ANCESTR2) available in 1980, 1990, and 2000 might offer more cases and better resolution.

If you want to distinguish between 1st generation and later generation, you can filter on BPL, as described below.  Until 1970, a variable for father’s birthplace, FBPL, is also available, and as described below, can be used to identify the second generation.

·         First-generation immigrants (Born abroad)
o   Filter based on the birthplace variable, BPL.  Codes for BPL identify the state or country of birth of the respondent.  If you want to restrict to people born in a particular country, look up the code for that country, and use that code in the filter.  For, example, people born in Sweden would be identified by BPL being equal to 405, so in the filter field on the screen for specifying your tabulation you would enter bpl(40500) along with whatever filters are relevant to your calculation.
o   BPL codes:  Make sure to choose ‘Detailed’ rather than ‘General’ so that you see the 4 or 5 digit codes that you will use in your filters.  If you use the 3 digit codes listed under the ‘General’ view, the filter will not work properly.
o   With the detailed codes that you will use in your filter, there may be multiple codes corresponding to the same country, because there are different codes for regions in the same country, especially if during the nineteenth century, these regions were separate countries.  For example, Canada is 15000-15083, Germany is 45300-45362.
·         The second generation (born in the U.S. to a parent who was born abroad)
o   You can identify people with a parent born abroad by use of the Father’s Birthplace (FBPL) or Mother’s Birthplace variable (MBPL).  To make things consistent, please base your definition of the ‘second generation’ on the father’s birthplace (FBPL). 
o   To ensure that you are considering individuals born in the United States to father or mother who was born abroad, combine a filter based on father’s birthplace being in the country of interest (FBPL) with a filter based on own birthplace (BPL) being the United States.
o   For example, to limit your tabulation to records of second-generation Swedish-Americans, that is people born here to Swedish fathers, you would include bpl(100-12092) fbpl(40500) in your filter.
o   With the detailed codes that you will use in your filter, there may be multiple codes corresponding to the same country, because there are different codes for regions in the same country, especially if during the nineteenth century, these regions were separate countries.  For example, Canada is 15000-15083, Germany is 45300-45362.
o   In the dataset that is available for online analysis, fbpl is only available through 1970.  It isn’t provided in the 1980, 1990, and 2000 data that are available online.   So your tabulations involving fbpl will normally end in 1970.
·         Second  and later generation, 1980-2000 (born in the U.S., but claiming an ethnicity)
o    In 1980, 1990, and 2000, the Census form included a question about ethnicity.  The response is in the variables ANCESTR1 (for the first response) and ANCESTR2 (for the second response).  The responses, as we know from the article by Hout and Goldstein on the Irish-American population, are highly subjective.  Nevertheless it does allow you to get a picture of an ethnicity that includes more than just the first- or second-generation.
o   Identify the second and later generations as people who specified your ethnicity of interest in ANCESTR1, but who also indicated that they were born in the U.S.
o   ANCESTR1 codes (note that values differ from the BPL and FBPL codes):  Make sure to choose the ‘Detailed’ view rather than the ‘General’ view so that you can see the three digit codes you will need for your filter.
o   For example, to limit your tabulations to records of second- and later-generation Swedish-Americans 1980-2000, you would add bpl(100-12092) ancestr1(890-900) to your filter.
o   Do not include fbpl in your filter here because in the dataset that is available for online analysis, fbpl is not included in 1980, 1990, 2000.  If you filter on on ancestr1 and fbpl you will end up with no cases.
o   Note that because responses on the ancestry question were open-ended, and people sometimes responded with a region in a particular country rather than the country itself, to ensure you get everyone associated with a particular country, you may need to specify a range of codes.  For example, Italian includes all the codes from 510 (Italian) to 730 (Venetian), so to pick up all the people who might plausibly be claimed to be Italian, you would specify ancestr1(510-730) in your filter.
·         Native-born population of the U.S.
o   Include bpl(100-12092) in your filter.

Using ‘comparison of means’ to calculate proportions at IPUMS-USA

(I wrote this for the students in my undergraduate lecture course Introduction to Social Demography. They are working with IPUMS-USA for a final project.  I thought it might be of more general interest to others who are using IPUMS-USA for each.)

We often want to calculate the proportion of people with some characteristic according to the values of two other variables.  The characteristic of interest might be represented by a single value of a categorical variable, or one or more values of a categorical variable, or even a range of values in a continuous variable.  We can do this with the ‘comparison of means’ tab that we use to compute the mean of income, socioeconomic index, or other continuous variables.  We just have to recode the categorical variable that we are interested in into a dichotomous variable that is 1 if the person has the characteristic we are interested in, and 0 otherwise.

For example, we might want to calculate the proportion of people who have ever been married, according to year and age group.  By ‘ever been married’, we mean anyone who is currently married, or was married in the past, but is now widowed, separated, or divorced.  In the MARST variable for marital status, that would be anyone who had values 1-5.  The remaining value, 6, corresponds to people who have never been married.

Of course, we could do a cross-tabulation in which our column variable was marital status, our row variable was age, and our control variable was year.  We could add up the percentages of people in statuses 1-5 in the various tables.  Of course, we could recode 1-5 into one category and have the computer do the addition for us, but we would still end up with a lot of output to go through.

Alternatively, we could recode marital status into a dichotomous variable that takes on the value of 0 or 1 according to whether someone has ever been married, and then compute the mean of that new variable for different combinations of year and age group.  In the following example, I have set up a ‘comparison of means’ calculation in which the dependent variable is MARST recoded so that all values corresponding to categories where a person is currently married or was married in the past (MARST 1 through 5) are 1, and the never married are 0.  The mean of this variable will be the proportion of people who are married, or were married in the past but are now widowed, separated, or divorced.

In the following, pay particular attention to the use of recode in the specification of the dependent variable to turn marst into a dichotomous variable:

 1 proportion_ever_married_by_age_and_year


Below is an example setting up a calculation to calculate proportions enrolled in school.  School enrollment is originally coded so that 1 indicates that someone is not enrolled, and 2 indicates that they are enrolled.  We recode to change 1 to a 0, and 2 to a 1, so that the mean ends up being the proportion currently enrolled.  Note that for the school enrollment variable, it only makes sense to consider people who are at the right age to be enrolled in school.

2 enrollment_example

Of course, you could do this with any number of other variables, including variables that were originally numeric or continuous.  In the example below, I have transformed POVERTY so that it is 0 or 1 according to whether the household in which an individual lives is at or below the poverty line.  POVERTY is originally coded as a three digit number that represents the household’s income as a percentage of the poverty line.  100 means that a household is at the poverty line, 001-099 means that a household is below the poverty line, and 101 up to 500 means that a household is above the poverty line.  There are no values above 500 because POVERTY is top-coded: if a household is earning more than 500% of the poverty line, it is just set to 500.  In the specification of the dependent variable, I have used the recode facility to change all values of poverty that are 101 or higher to 0, and all values of 001 to 100 to 1.  The mean of the variable is therefore the proportion of people living in poverty.  Note that the recode excludes 0 because 0 indicates that the value is not available.

3 poverty_recode_example

The value in each cell represents the proportion of individuals of the specified race in each year who are in poverty.