Summer 2014 China Multigenerational Panel Dataset Workshop at SJTU (English announcement)

The 4th China Multigenerational Panel Dataset Workshop
Shanghai Jiaotong University, Minhang Campus
Shanghai, China

July 14-25, 2014


The Center for the History and Society of Northeast China at the Shanghai Jiaotong University School of Humanities will hold its 4th summer China Multigenerational Panel Data workshop from July 14 to July 25.

The workshop will focus on introducing the China Multigenerational Panel Datasets (CMGPD) as sources for the study of demography, stratification, and social and family history. These include the China Multigenerational Panel Dataset – Liaoning (CMGPD-LN) and the China Multigenerational Panel Dataset – Shuangcheng (CMGPD-SC).  The CMGPD have been released via the Inter-university Consortium for Political and Science Research.  The latest versions of the CMGPD document are available for download.

The CMGPD datasets have many unique features that make them useful not only for the study of Chinese population, social, and family history, but for the study of demographic, social and economic processes more generally.  Their features also make them useful as testbeds for researchers developing novel quantitative techniques.  The datasets are longitudinal, multi-generational, and structured at multiple levels, including the individual, the household, the kin group, the community, the administrative unit, and the region.

UCLA Professor of Sociology Cameron Campbell will be the primary lecturer. Guest lecturers will include Distinguished Professor and Dean of Humanities and Social Sciences at the Hong Kong University of Science and Technology James Lee; Yuxue Ren, Professor of History at Shanghai Jiaotong University; and Dong Hao, PhD student at the Hong Kong University of Science and Technology.

This class is intended to 1) introduce researchers to the CMGPD datasets and help them decide whether they may be useful in their own studies, 2) give current users an opportunity to learn more about the origin and context of the data, and 3) give participants basic instruction in the use of STATA to describe, organize and analyze the data.   Researchers who have already started using the CMGPD-SC or CMGPD-LN are welcome to attend and take advantage of the opportunity to discuss any questions they may have with Lee, Campbell, and others who were involved in the creation of the dataset.

Lectures and discussion will focus on 1) the historical, social, economic and institutional context of the populations covered by the data, 2) key features of the data, and 3) potential applications.  There will be optional sessions to introduce the Training Guide and demonstrate basic procedures for downloading the data from the website and loading it into STATA.

Please note that while there will be basic instruction in the use of STATA to organize and analyze the data, this is not intended as a class in STATA, or introductory statistics. Students looking specifically for instruction in STATA, statistics, or data management are encouraged to look elsewhere. Again, the class is intended for participants who want to assess whether CMGPD is suitable for their research interests, or are already considering the use of the CMGPD and seek basic instruction in the use of STATA to manipulate and analyze it.

The workshop will include daily exercises to introduce key features of the data, and STATA techniques for taking advantage of these features. Participants will also complete a small project of their own design using the data and present it on the last day of the workshop.

If any non-Chinese speakers enroll, the lectures will be in English.  If the participants all speak Chinese, lectures may be in Chinese, or a mixture of English and Chinese.  Discussion will be in English and Chinese.

The Shanghai Jiaotong University Center for the History and Society of Northeast China was established as a research unit by a collaboration of the Shanghai Jiaotong University (SJTU) School of the Humanities and the Hong Kong University of Science and Technology (HKUST) School of the Humanities and Social Sciences.


China Multigenerational Panel Dataset – Liaoning (CMGPD-LN)

The CMGPD-LN is an important dataset for the study of China’s family, social and demographic history, and for the study of demography and stratification more generally. The dataset is suitable for application of a wide variety of statistical techniques that are commonly used in social demography for the analysis of longitudinal, individual-level data, and available in the most popular statistical software packages. The dataset is distinguished by its size, temporal depth, and richness of detail on family, household and kinship context.

The materials from which the dataset was constructed are Shengjing Imperial Household Agency household registers held in the Liaoning Provincial Archives. The registers are triennial. Altogether there are 3600 of them. We transcribed a subset of them to produce the CMGPD-LN, which spans 160 years from 1749 to 1909. At present, the dataset comprises 29 register series, and consists of 1,500,000 records that describe 260000 individuals over seven generations. The CMGPD-LN is accordingly an important resource for the study of historical demography, sociology, economics, and other fields.

The CMGPD-LN and associated English-language documentation are already available for download at ICPSR.

China Multigenerational Panel Dataset – Shuangcheng (CMGPD-SC)

The CMGPD-SC covers communities of recent settlers in Shuangcheng, Heilongjiang in the last half of the nineteenth century and beginning of the twentieth. It contains 1.35 million records that describe 100,000 people. The registers cover descendants of urban migrants from Beijing and rural migrants from neighboring areas in northeast China who came to the area in the first half of the nineteenth century as part of a government organized effort to settle this largely vacant frontier region. One of the distinguishing features of this dataset is the availability of linked, individual-level landholding records for several points in time. The data also include a rich array of other indicators of household and family context and socioeconomic status.

Pending release of the CMGPD-SC through ICPSR, the data are available for download here.


Monday, July 14, 2014 to Friday, July 25, 2014

Shanghai Jiaotong University School of Humanities (SJTU Minhang Campus, Shanghai)

Application deadline
May 1, 2014

See link below to download application

Application procedure

Please send your personal statement, curriculum vitae, and application form (English or 中文) as attachments to

Applications from faculty, postdoctoral researchers and graduate students are welcome. Applications from graduating college seniors will also be considered if they have already been accepted into a graduate program beginning fall 2014.  In that case, the application should include a copy of their graduate school acceptance. Any other interested parties should contact our staff at before applying to see if they will be considered.

Participants should be able to speak or read Chinese or English.  No prior experience in statistics, demography, or Chinese history is required.  Applicants must explain the reasons for their interest in the data in their application, and should demonstrate that they have background, experience or interests that in some way are relevant.

Participants who are Chinese nationals will have accommodations. Participants who are not Chinese nationals will receive assistance with arranging accommodations, and will receive a housing subsidy to help offset their costs. Participants who want other accommodations will have to arrange them on their own and will be responsible for all associated costs.

Participants should bring their own computer.

Students are responsible for all travel and local expenses, health care expenses, and other incidentals. Participants coming from abroad are strongly encouraged to confirm that their health insurance offers international coverage, or purchase travel health insurance.

Participants who are not Chinese nationals will need to obtain visas. We will issue invitation letters to facilitate the visa application. We strongly urge that accepted participants who need visas begin the application process as soon as possible after they are notified of their acceptance.

At present we expect to be able to accommodate 25-30 participants.


Required Reading

Read the following before the workshop begins.  The highest priority are the specified pages in in the CMGPD-LN and CMGPD-SC User Guides.


The documentation below is available here.

  • CMGPD-LN User Guide.  English pages 1-54, 90-96 or Chinese pages 13-64, 96-101.  Skim the descriptions of variables to look for ones that may be relevant to your research.
  • CMGPD-SC User Guide.  English pages 1-47. Again, skim the descriptions of variables to look for ones that may be relevant to your research.
  • CMGPD Training Guide. Pay particular attention to the sections at the beginning that introduce the data and highlight its distinctive characteristics.

Research Articles

  • Campbell, Cameron and James Lee. 2002 (publ. 2006). “State views and local views of population: Linking and comparing genealogies and household registers in Liaoning, 1749-1909.” History and Computing. 14(1+2):9-29.
  • Bengtsson, Tommy, Cameron Campbell, James Lee, et al. 2004.  Life Under Pressure: Mortality and Living Standards in Europe and Asia, 1700-1900. MIT Press.  Published in Chinese as 托米·本特森,康文林,李中清等. 2008. 压力下的生活:1700~1900年欧洲与亚洲的死亡率和生活水平. 北京: 社会科学文献出版社. Translated by 李霞 and 李恭忠.  Appendix A.
  • Campbell, Cameron and James Z. Lee. 2011. “Kinship and the Long-Term Persistence of Inequality in Liaoning, China, 1749-2005.” Chinese Sociological Review. 44(1):71-104.

Review Articles

  • 康文林 (Cameron Campbell).  2012.  “历史人口学 (Historical Demography).”  Chapter 8 in 梁在编 (Zai Liang ed.) 人口学 (Demography).   北京:人民大学出版社 (Beijing: Renmin University Press), 233-265.

Select one or two of the following research articles based on your own interests (or another published article that uses the CMGPD), and read before the workshop starts

  • CHEN Shuang, James Lee, and Cameron Campbell. 2010. “Wealth stratification and reproduction in Northeast China, 1866-1907.” History of the Family. 15:386-412.
  • Bengtsson, Tommy, Cameron Campbell, James Lee, et al. 2004.  Life Under Pressure: Mortality and Living Standards in Europe and Asia, 1700-1900. MIT Press.  Published in Chinese as 托米·本特森,康文林,李中清等. 2008. 压力下的生活:1700~1900年欧洲与亚洲的死亡率和生活水平. 北京: 社会科学文献出版社. Translated by 李霞 and 李恭忠.  Chapter 10.
  • Wang Feng, Cameron Campbell, and James Z. Lee. 2010. “Agency, Hierarchies, and Reproduction in Northeastern China, 1789 to 1840.” Chapter 11 in Noriko Tsuya, Wang Feng, George Alter, James Z. Lee et al. Prudence and Pressure: Reproduction and Human Agency in Europe and Asia, 1700-1900. MIT Press, 287-316.
  • Chen Shuang, Cameron Campbell, and James Z. Lee.  Forthcoming.  “Categorical Inequality and Gender Difference: Marriage and Remarriage in Northeast China, 1749-1912.”  Chapter 11 in Lundh, Christer, Satomi Kurosu, et al. Similarity in Difference.


If you are not familiar with STATA, prepare for the workshop by reviewing as many of the materials for learning and using STATA at UCLA IDRE as possible. You are also strongly encouraged to watch video tutorials at the STATA website. Ideally, by the time you arrive at the workshop, you should already be able to  carry out very basic operations in STATA such as loading and saving files, creating tabulations and so forth. Do try to download the CMGPD-SC or CMGPD-LN and make sure you know how to load them and carry out very simple operations.

Recommended Reading

  • As much of the User Guides and Training Guide as you can.
  • 定宜庄, 郭松义, 李中清, 康文林. 2004. 辽东移民中的旗人社会.  上海:上海社会科学出版社.
  • Lee, James and Cameron Campbell. 1997. Fate and Fortune in Rural China: Social Organization and Population Behavior in Liaoning, 1774-1873. Cambridge University Press.
  • 李中清,王丰.  2000.  人类的四分之一: 马尔萨斯的神话与中国的现实:1700-2000。  三联·哈佛燕京学术丛书。(English: Lee, James and Wang Feng.  1999.  One Quarter of Humanity: Malthusian Mythology and Chinese Reality, 1700-2000.)
  • Bengtsson, Tommy, Cameron Campbell, James Lee, et al. 2004.  Life Under Pressure: Mortality and Living Standards in Europe and Asia, 1700-1900. MIT Press.  Published in Chinese as 托米·本特森,康文林,李中清等. 2008. 压力下的生活:1700~1900年欧洲与亚洲的死亡率和生活水平. 北京: 社会科学文献出版社. Translated by 李霞 and 李恭忠.

Tentative Schedule (at Onedrive)


Preparation of the CMGPD-LN and accompanying documentation for public release via ICPSR DSDR was supported by NICHD R01 HD057175-01A1 “Multi-Generation Family and Life History Panel Dataset” with funds from the American Recovery and Reinvestment Act.

Preparation of the CMGPD-SC and accompanying documentation for public release via ICPSR DSDR was supported by NICHHD R01 HD070985-01 “Multi-generational Demographic and Landholding Data: CMGPD-SC Public Release.”

The CMGPD summer workshops in Shanghai have been supported by Shanghai Jiaotong University, the School of Humanities, the Department of History, and the Center for the Society and History of Northeast China.  We are also grateful to staff at a variety of campus units at SJTU for their logistical support.

Errata from Fate and Fortune

Our book Fate and Fortune in Rural China: Social Organization and Population Behavior in Liaoning 1774-1873 appeared nearly twenty years ago. For some time, we have meant to collect and put in one place the errata that have been discovered over the years. There aren’t that many, thankfully, but it’s nice to list them all in one place. There are definitely more than are listed here, but so far, I haven’t been able to find them all in my notes. I’ll keep looking, and adding corrections as I find them. If you spot any typos in Fate and Fortune, let me know!

Page 6

In footnote 6, ‘cannabalized’ should be ‘cannibalized’. Reported by Graham Murray Campbell, April 1997.

Page 175

The title of Table 8.10 should be “Transitions in Banner occupation and organizational status for adult males, 1774-1873.” Reported by Xiangyun Wang, January 2014.

Page 263

The following, referred to in footnote 44, is missing from the references:

Pitkänen, Kari J. and James H. Mielke. 1993. “Age and sex differentials in mortality during two nineteenth century population crises.” European Journal of Population. 9:1-32.

Reported by Wang Xiangyun, February 2014.

Discussion of One Child Policy on RTHK Radio 3

Earlier this week, I participated in a panel discussion on the future of the One Child Policy on the show Backchat on RTHK Radio 3.  RTHK is public radio here in HK, and Radio 3 is there English language service.

The panelists were Stuart Basten at Oxford, Kerry Brown at the University of Sydney, Shaun Rein at China Marketing Research, and of course yours truly.  I thought the discussion was very high quality, and covered a lot of ground.

The show is available online, broken into two thirty minute halves: The first half starts about 30 minutes into the first (8:30-9:15) link, and finishes with the second link (9:15-9:30).

This was my first time in a radio studio. I was struck by how quickly the hosts could shift from their regular voice during conversation in breaks, to their ‘radio voice’ once the light came on.  And of course it is always amazing to me that anyone can make it through so many spoken sentences without an awkward pause, an “Uhhhh”, “Well…” or some utterance.

English proficiency and college admissions in China

I was interested and somewhat pleased to see that recently, there has been some discussion of lowering or eliminating the weight attached to English scores in college entrance exams in China. The Wall Street Journal China Real Time Report blog has a nice discussion of what has been happening:

Much of the commentary I have seen adapts either a practical or nationalistic interpretation. The practical terms revolve around the idea, probably correct, that many people will in fact have no need for English after they finish college. The majority of college graduates are in fact unlikely to end up in that involve contact with English-speaking foreigners, or reading English language documents. Accordingly, it doesn’t make sense to make English language ability a key criterion for entrance into college. I tend to agree with this. It doesn’t make any more sense to consider the English language ability of applicants to college in China than it does to consider the Chinese language ability of applicants to college in the United States. This is especially compelling because I think the sort of preparation that students do in order to maximize their score on the English section of the gaokao necessarily leaves them with a practical mastery of English that would be useful in routine interaction. Even worse, for the vast majority of students who in fact don’t go on to jobs after graduation that require use of English, whatever they learned for the gaokao will have been wasted.

It seems to me that it would make much more sense to eliminate the English portion of the gaokao completely, and offer intensive instruction in English only to students in majors where it will benefit them. While it may be that speaking a foreign language without an accent requires learning it before puberty, for most people, it won’t really matter if they speak English with an accent or not. Why not save intensive English instruction for top universities whose graduates are most likely to use English, or for students in second or third tier universities in majors that may lead to jobs where they are going to use English.

One way or the other, it seems like it would make sense to do some fine tuning and take an evidence based approach to deciding who would benefit the most from English language instruction, as opposed to the current one size fits all approach that pressures all students to learn enough English to do well on the gaokao, regardless of whether or not they will ever use English after they finish college.

The commentary I have seen that interprets the reduced or eliminated emphasis on English in college admissions as some kind of symbol of rising nationalism seems silly. There may or may not be increasing nationalism in China, but I am not sure that a decision to downgrade the importance of English in college admissions has anything to do with it. As I noted above, there are all sorts of practical reasons to reduce the emphasis on English. Refocusing education on a native language that people will actually use and de-emphasizing a foreign language that many will never use isn’t nationalism, it is common sense.

I think, however, that there is a third reason to think the reduced emphasis on English is a good idea: equity. I don’t have any evidence at hand to back this up, but I would hazard a guess that among the various things that one could imagining testing students on, English language ability might very well be the most heavily influenced by parental social class. I would suspect that among all of the subjects that parents could spend money on for after-school lessons or tutoring, English language performance is probably the most responsive. Again, I don’t have any evidence to support this, but I suspect that progress in English, like progress in any foreign language, is fastest when students have the sorts of skilled teachers and intensive instruction that are available to families with money. And of course, children who have upper middle class parents who already speak and read English, or who have the money to send their children abroad during the summer, will be especially advantaged.

Leaving the issues of after school instruction and parental English language ability aside, I would guess that urban/rural and school differentials in the quality of English language instruction are much more extreme than differentials in the quality of instruction in math, Chinese, science, or other subjects. Schools may not only differ in whether they offer any English at all, but in what kind of teachers they can hire. Rural elementary schools may not even offer English language instruction, while elite primary schools in major Chinese cities may very well have native speakers with teaching credentials teaching students from an early grade. The depiction of elite middle and high schools in Beijing in this Washington Post article is especially suggestive of the sorts of growing gaps in the quality of instruction across schools: Conversely, we were recently in the countryside and happened to visit an elementary school in a village, and the principal told us that there was no English instruction there.

Thus for both practical and equity considerations, I am supportive of the idea of reduced emphasis on English language ability in college admissions. In an ideal world, college admissions would be based on the aspects of student performance that are least sensitive to parental social class, and most reflective of the student’s own ability and potential. I don’t know enough about the relevant literature to know what subjects those are. But again, I would hazard a guess that differences across schools in the quality of instruction in math, science, and a variety of other subjects are much less extreme than differences in the quality of English language instruction. And I would also suspect that after school tutoring, summer camps, and other expensive activities have much more effect on English language scores than they do on other kinds of scores.

I realize of course that there are all sorts of reasons for criticizing the gaokao, and considering alternatives, but in the meantime, it seems that anything that might reduce the influence of parental socioeconomic status on performance on the gaokao is to be welcomed. I have some ideas about the gaokao, but they are related to some other ideas I have about college admissions in general, in China and the West, and I will leave that for another blog post.


Reflecting on my time as an undergraduate at Caltech 反思我加州理工學院讀本科經歷

Why I wrote this

I recently began to receive reminders about the Caltech Reunion Weekend scheduled for May 15-18, 2014.  I would like to attend, but it would require a special trip back from Hong Kong.  We’ll see.  Combined with my involvement with undergraduate education for the last 17 years or so at UCLA and now at HKUST, and a recent visit to the Caltech Oral History site, these emails triggered a flood of recollections and reflections about my time at Caltech.

Before I get rolling, a note: If you’re considering applying, go talk to the students who are there now. If you’ve already been accepted, just go. I have visited Caltech on a number of occasions in the time since I graduated, and had conversations with a variety of students, and it seems like it offers an even better undergraduate experience than when I was there. The curriculum has been improved, the students are just as committed and talented as ever, there are better housing options, and other changes have taken place as well. Whatever you do, don’t treat anything I have to say below as useful or relevant for your own decision-making, since I am talking about the situation as I remember it nearly three decades ago, and there have been a lot of changes, as far as I can tell mostly positive, since then.

To the extent I have any goal here, it is to counter what I sometimes think is a tendency on the part of alumni, myself included, to compare whatever they hear about Caltech now with an idealized, sentimental and somewhat selective recollection of what it was like when they were there, and then go on to claim that the era during which they attended, whenever it was, really were the glory days. To anyone who wants to claim that the undergraduate experience was much better years or decades ago during some half-remembered golden era that happened to coincide exactly with the four years they attended, I’d like to offer some reminders about some of the problems that existed back then. To the extent there is any critique here, it is self-critique, since I think the problems back in the day were more about the environment in the residential houses than with the education we were provided.

Along these lines, I also wanted to offer a reminisce and reflection that is a little deeper than the usual anecdote sharing that goes on when alumni gather. Alumni, myself included, nearly all seem to have a repertoire of stories about pranks, over-the-top homework assignments, outlandish incidents during parties, and Feynman encounters. There’s nothing wrong with that since it was indeed a remarkable and unique experience, but I do think that given popular interest in Caltech, it is time to offer more nuanced and thoughtful recollections which focus on the overall experience.

Caltech was and I believe is absolutely unique.  Caltech’s single-minded focus on excellence and the opportunities for involvement in research attracted me in the first place.  It is a small, highly specialized institution with a clear focus on excellence.  I doubt there is anything quite like it, anywhere.  I’m not normally given to hyberbole, but I feel completely comfortable saying that.  I have had a number of opportunities to meet current or recent Caltech undergraduates in the last few years, and everything I hear from them sounds very positive.  Indeed, the environment they describe sounds much better than the one I experienced in terms of the focus on teaching and overall environment. I continue to be pleased that of the elite private universities, Caltech has by far the largest share of students coming from economically less privileged families. I put my money where my mouth is: I send money to the Alumni Fund every year, and encourage my classmates to do so as well.

Why I chose Caltech

My interest in Caltech was cemented at some point in the summer after my junior year.  I was going to high school in a small town in northern Illinois.  My father, a professor at the Illinois Institute of Technology in Chicago, attended a meeting in Las Vegas right after I finished junior year. I went with him. While he was in his meetings, I lounged by the poolside, and occasionally tried my hand at the slot machines. When my father was done with his meetings, we rented a car and went on a long, looping tour that included stops at Harvey Mudd, Caltech, Stanford, Berkeley, and Deep Springs College. A few months after the beginning of my senior year in high school, I had lost interest in other institutions.

By this time, I was also interested in pursuing studies on China, though mainly as a sideline.  My father, who was in computer science, began to have visitors from China in the early 1980s.  Many of them were mid-career professionals who had been sent out to update their skills.  Many of them had suffered terribly during the Cultural Revolution because they were highly educated and/or from families that had been very high status before 1949.  My father also visited Beijing twice as part of academic delegations, I think in 1982 and 1983, and came back telling me that China was a country that I needed to learn more about, because it was on the move.  Based on conversations with my father’s visitors, and my father’s stories about his visits, I became interested in China, and began reading up on my own.  I decided that when I went to college, I wanted to take courses on its history and society.

My brief visit to Caltech convinced me that it was the place for me.  Between talking to the student guide for the campus tour, and an appointment with someone in Admissions, I was sold.  As fate would have it, while I was touring Caltech, the student who led the tour told me that there was a young assistant professor at Caltech named James Lee who was doing quantitative research on Chinese history.  My father advised me that if I really was interested in China, or the humanities and social sciences in general, at a place like Caltech I would probably have the relevant faculty to myself. Everyone else would be trying to attract the attention of faculty in Physics, Engineering and the life sciences.   Conversely, if I went to Berkeley or Stanford, I would probably be one of many eager young undergraduates struggling to attract the attention of humanities and social sciences faculty.

One way or the other, I really wanted a career as an academic and never thought seriously about any other line of work.  My father was a professor and from what I saw, it was the best job in the world, even if it didn’t pay that well. Based on my visit to Caltech and what I read about it, I thought it would be the best place to acquire a training that would lead to graduate school and then a career as a professor, most likely in electrical engineering or computer science.

I applied early decision to Caltech.  I also applied to UC Berkeley, and completed the first stage of an application to Stanford.  I attended some recruitment events organized by MIT and other schools, but they didn’t excite me.  I received my early acceptance from Caltech at some point in December, so never completed my Stanford application, and didn’t apply anywhere else.  I was accepted at Berkeley and offered a very generous scholarship, but by then my mind was set.  It was only later when I became part of the UC system and taught at UCLA that I realized how unusual it was for an out-of-state student to be offered a scholarship like the one that Berkeley offered.

I began the summer after my high school graduation in an absolutely disastrous job delivering pizzas. I quit after only a few weeks. I spent the rest of the summer in a much more satisfying job mowing lawns.  A friend had built up a nice business mowing lawns during the summer, and turned it over to me after going off to college. I liked it because I set my own schedule. I would drive by my customers’ houses in my 1971 Dodge Coronet, and if I thought their lawns needed mowing, I would stop, get the lawnmower out of the trunk, mow their lawn, and leave a bill for them. My customers seemed happy with my service and paid promptly.

Freshman year

Finally, in September 1985, I took the Amtrak train from Chicago to Pasadena so that I could start at Caltech.  At the time, the Amtrak train from Chicago to LA still stopped in Pasadena.  It ran along a right of way that is now used by the Gold Line. I disembarked in Pasadena, and by previous arrangement, a student associated with the Caltech Y picked me up and drove me to Caltech.  Upon arriving at Caltech, I checked in, and was assigned a temporary room in Lloyd House. I stayed in this room while in Rotation, which would determine which of the seven residential houses I would end up in.

Soon after my arrival at Caltech, we all left for Frosh Camp, at Camp Fox on Catalina Island.  My memories of Camp Fox are disjointed, and possibly completely wrong.  I remember some sort of well-intentioned but misdirected effort at education in dating etiquette and sexuality that inexplicably included people dressed as toads. The name of the skit was Love Toads. I also remember Gary Hindoyan of Burger Continental fame at a grill cooking chicken for us, and open-air sleeping pavilions. I don’t remember getting seasick on the way to Catalina Island, which is surprising, since I tend to get seasick easily.

We returned to Caltech for Rotation after Frosh Camp was over.  At Rotation, over the course of one week, we visited each of the seven residential houses and ate dinner in each of them and attended a reception afterwards. I developed a strong preference for Ruddock, Lloyd, and perhaps Blacker.  I didn’t have strong preferences between these three houses, but I did know that I wanted to end up in one of them, and didn’t want to end up in any of the remaining four.  I ended up in Ruddock, along with several other incoming freshmen I had come to know during Rotation.

Once we were in Ruddock, we had some kind of initiation.  I don’t remember many details.  I seem to think it may have included a scavenger hunt or some other fairly innocuous activities intended to introduce us to the older students. I don’t remember anything in Ruddock that would have come anywhere close to hazing.

Initiations in other houses may have been more problematic. While most friends in other houses reported enjoyable experiences that helped integrate them, some did report more disturbing experiences. Again, my information is all 25 years old, so take it with a grain of salt. While apologists for initiation rituals at colleges or other organizations now typically reply that participation is consensual, it isn’t clear to me what ‘consent’ means when you have wide-eyed entering freshman who are away from home for the first time, and may be scared, lonely or confused, and desperate to impress older students or other classmates. I’ll come back to my concerns about the houses later.

Silly initiation rituals are hardly unique to Caltech. Leaving people of college age isolated from engagement with the world outside is most likely to result in some combination of the Stanford prison experiment and Lord of the Flies.  Unfortunately, the sort of dysfunctional group-think that leads to hazing in immature and socially isolated groups seems to be human nature, as various awful examples of the results of hazing at various institutions and organizations seem to attest. While incidents like the one with the FAMU marching band are especially awful and thankfully rare, every year there are problems at many institutions at which student organizations are allowed to take charge of welcoming new students. Even my current institution doesn’t seem to be immune from problems.

I have many fond memories of freshman year, the best of which involve getting to know my remarkable classmates. I went to middle and high school in Dekalb, Illinois, a small, relatively homogeneous town, and I attended elementary school in Winnipeg, Manitoba, which was similarly homogeneous. At Caltech, I met people from all sorts of backgrounds, with diverse interests. The only thing that everyone else had in common is that they were smarter than me, and they were all interested in the pursuit of knowledge more generally.  As difficult as the classes all were, I never had the sense that we were competing with each other.  Rather, we were all engaged in a common, collective enterprise. There were always classmates willing to lend a dimwit like myself a helping hand, and patiently explain for the Nth time some derivation that I was struggling with.

Many of my best memories of my first two years are somewhat fragmentary: house-organized events like a Secret Santa at the end of fall quarter my freshman year; late night trips to eat at Lucky Boy’s, Tommy’s, and cheap restaurants in Monterey Park and sometimes Little Tokyo; late-night explorations of Los Angeles with classmates or friends who had a car.  My favorite memories from my four years at Caltech are of walks around campus in the evening with a classmate, when a breeze was blowing and the palm trees were swaying. On evenings like that, I felt like I had walked into Steely Dan’s Gaucho or the Eagles’ Hotel California. At the same time, carrying out a synthesis in Chemistry Lab and ending up with a yield greater than 100% was an embarrassment.  I also managed to break expensive glassware in Chemistry lab. I still remember a quiz in Chemistry on which I scored 9 out of 100, followed by my chasing around a classmate who had scored 8 out of 100 waving my quiz paper and yelling “I kicked your ass!”

Academically, freshman year at Caltech was a shock.  I had done well in high school, but compared to my fellow Caltech classmates, I was average, or below average.  I didn’t have any particular problem with being in the middle of the pack. I’d rather be surrounded by people smarter and more creative than myself than dimwitted, unmotivated dullards. While all the classes were challenging, it was frustrating that at least some of them were clearly designed to scare away anyone who wasn’t in the top tail of the distribution, and even more frustrating that the relevant distribution sometime had little to do with skills that mattered.

The worst offender by far was the required course in electrical engineering, EE 14ab. Note: if you’re thinking about applying to Caltech to study EE, ignore the following rant, and keep in mind that they have reformed the curriculum several times in the nearly 25 years since I graduated to reform the most problematic aspects and replace them with more thoughtfully designed components. From talking to recent graduates at alumni gatherings and so forth, the major sounds much, much better designed now.

In retrospect, EE14ab as taught in the late 1980s must have been designed deliberately to drive students away from the major.  It was taught by an adjunct professor early Monday mornings and late Tuesday evenings.  I passed the class, but only barely. I was actually OK with the easier material involving the classic discrete components such as resistors, capacitors, and inductors, mainly because I previously had some introduction to circuits, but when we hit transistors, I was lost. The strange thing about this as a gatekeeping course was that many if not most of the students would never have to characterize a network of discrete components, including transistors, again. Even the later coursework that was in the analog/continuous domain like power electronics or waves and antenna was largely independent of the content of EE 14, and could have been taken without it. And EE 14 had nothing to do with any of the digital/discrete domain courses that we took for the rest of our time. When in another context I came across this manuscript on the role of poorly designed gatekeeping courses in STEM majors in reducing diversity, I thought of EE 14 immediately.

Many classes were much better. I was very impressed with the required physics sequences, even though I didn’t do as well as I would have liked. The freshman and sophomore physics sequences were extremely challenging, and my performance wasn’t much to write home about, but it was obvious that tremendous thought and effort went into them. Physics even had an ombudsman program with representatives from each TA sections invited to lunches every month to provide feedback. The instructors and TAs each quarter attended. I have never seen anything like it since, and quite frankly, can’t imagine doing anything similar at UCLA or my present institution, HKUST. In retrospect it seems amazing that people like Tom Prince, Robert McKeown, and Ricardo Gomez, to name a few, took a required physics sequence so seriously. And my TA my freshman year was actually a professor, Brad Filippone.

Chemistry, math, and applied math were more uneven, at least for me.  The instructors meant well and worked hard, but none of these sequences matched the physics sequence in terms of efforts to solicit feedback and improve teaching. The instructors were doing their best, but as outstanding researchers in their various areas, were probably not the best people to communicate with mere mortals like ourselves. Many of them, like Harry Gray or Sunney Chan, were engaging and entertaining even when I had no idea what they were talking about.

Now that I teach large lecture courses, I’m not going to complain about anyone else’s lecture style. I’m not about to cast any stones. I know how difficult it is to teach a required lecture course that both holds the attention of undergraduates and actually seeks to teach something. In my experience, lectures seem to be entertaining or pedagogically useful, but rarely both. I put many of my own students to sleep. And my own study habits left much to be desired. Perhaps if I had been more diligent, paid attention during lecture, and started problem sets more than a day before they were due, I would have learned more.

By spring of freshman year, I was discouraged about my prospects, and contemplated transferring.  I owe it to Chris Brennen, the Master of Student Houses (MOSH), and Ed Callaway, my RA, that I didn’t.  Spring quarter, I had been talking to my parents about my concerns.  My father was concerned enough that he contacted Chris Brennen (, who at the time was our MOSH.  Brennen contacted the RA in Ruddock, Ed Callaway ( who came in and talked to me.  One way or the other, I pulled myself together and set aside my thoughts of transferring.

One vivid memory I have is of a remarkable lab class (I think APh 9) where we fabricated integrated chips.  We worked with hydroflouric (HF) acid, without gloves.  David Rutledge’s philosophy, which I think was correct, was that using gloves would lull us into a false sense of security.  He preferred that we wash our hands and forearms thoroughly and repeatedly.  I was so terrified by the prospect of having HF burn down to the bone in my hands and forearms that I continued washing well after lab was over and I was back in my dorm room. When I woke up the next morning after lab, the first thing I did was check my hands and forearms to confirm that no holes had opened up overnight. Recently I saw some photos of damage caused by use of krokodil, and it was exactly what I had nightmares about when it came to working with HF. At the same time, I was so worried that my own clumsiness would cause me to fail the class that my hands shook when I tried to put chips into the boron furnace, and I kept dropping chips and having to start over.  Finally the TA told me that I was probably going to pass the class no matter what.  Somehow, the knowledge that I couldn’t fail the class lifted a weight from my shoulders. My hands stopped shaking when I was trying to insert the chips into the boron furnace. No longer living in fear that my clumsiness was going to cause me to fail the class, I sailed through the remaining fabrications with ease, and ended up doing quite well.

Sophomore year and beyond

Sophomore year and junior year, my future in the social sciences began to take shape.  I ruled out a career in traditional electrical engineering, mainly because I had no aptitude for work in the analog/continuous-time domain.  I struggled in the Applied Math Analysis course (AMa 95).  I could try to blame the instructors, including Cohen and Wu, but my study habits definitely left something to be desired. Regardless of the reason, I struggled in all of the courses that were continuous-time/analog. The worst was the wave and antenna course in electrical engineering (EE 151).  To this day, based on that experience, I think there is some element of voodoo in wireless, and I am amazed that mobile phones, let alone AM/FM radio, actually work. I think there was another continuous-time/analogue class in there somewhere that I muddled through, but I forgot its number.  For some reason I think it was EE32ab, on linear systems.  I will come back to that later when I discuss some of my concerns about the environment at Caltech, at least at the time I was there.

I decided that if I was to stay in engineering, I wanted to remain in the discrete-time/digital domain.  I was influenced by some of what I thought were the best taught courses I ever had. Maybe it was just me, but they were all in the discrete/digital domain, not the continuous/analog domain.  This included a really excellent digital signal processing course taught by P.P. Vaidyanathan ( that I think was EE 112. Even more exciting was a course on information theory (EE/Ma 126) taught by Yaser Abu-Mostafa ( that I think was probably one of the two best taught courses I ever took, along with Robert McEliece’s ( course on error-correcting codes (EE/Ma 127ab).  Abu-Mostafa’s and McEliece’s lectures among the best, most carefully planned, and most elegant I ever heard. To this day, I wish I could lecture like Abu-Mostafa. He didn’t use anything but chalk and a blackboard, but his points were crystal clear. More importantly, the problem sets were absolute models of what homework should be like. They always started with fairly straightforward exercises, then built in a careful and deliberate fashion to very sophisticated problems.

I still remember with pride being one of the first in my house to realize that a proof for an assignment in EE/Ma 126 required two separate proofs, each for a separate but overlapping range of numbers. In essence, a proof that something was true for all numbers required proving that it was true for all numbers less than a, and separately proving that it was true for all numbers greater than b, but b was less than a, so there was a range of numbers between a and b for which the proof was ‘double.’ I had never seen anything like it, but once I realized that the problem was separable in this unusual way, it was easy.

Abu-Mostafa’s course was especially influential on me as a budding social scientist because that is when I realized that any dataset really was just a string of bits, with a finite amount of information determined by its complexity. This led me to conclude that there was only a finite amount of new knowledge that could be extracted from any given dataset, and that anyone who purported to have an amazing new technique for extracting new insights from an existing dataset that had already been worked over and milked dry was probably selling snake oil.  Obviously that wasn’t the goal of the course, and Abu-Mostafa might be surprised that I made such a connection, but in retrospect I think it was important.

More importantly, during sophomore year I finally had the opportunity to pursue work on China.  I took a course with James Lee, at that time an assistant professor.  He was collecting and analyzing household register data from Liaoning. At some point in winter or spring of my sophomore year, he told me that the student who was working with him to help organize the data he was collecting was going to move on and look for an opportunity in a physics lab.  Lee had accumulated population register data that at the time was in rectangular flat files and being analyzed with programs written in C, and he was looking for someone to manage the data.  I had some experience in database management, and jumped at the opportunity.  I rewrote all of the code that was being used to manage the data in dBase III+, and never looked back.  The summer after sophomore year, I had a SURF, and joined Lee when he went to Beijing and Shenyang for research. That was my first time outside North America, and is worth a blog post in its own right.

By senior year, I made up my mind to pursue graduate training in the social sciences.  I ended up not majoring in Electrical Engineering because I didn’t take the power electronics course (I think EE 40) and didn’t complete the senior project course (I think EE 91abc). Rather, I ended up with a degree in Engineering and Applied Sciences, and a second degree in History. I applied to and was accepted at sociology and/or demography programs at a variety of schools.  I decided on Penn.  I did apply to graduate schools in electrical engineering and computer science, and was accepted at Columbia.  I still remember the silence at the other end of the line when I told the Electrical Engineering professor at Columbia who called me to offer admission and a fellowship that I had decided to pursue a PhD in sociology and demography.  Because of opportunities that were available to me through Caltech, in particular a Durfee and a Watson Fellowship, I received support to spend the year after graduation in Taiwan and then Beijing studying Chinese, before I showed up in Philadelphia for graduate school.


The good

Looking back, I’m glad I went to Caltech, and I don’t have anything to complain about in terms of how I was treated. It is impossible to imagine that I could have ended up where I am now except as a result of being at Caltech.  Much of this, of course, was luck.  At Caltech I happened to run into one of probably a handful of historians working on China who were doing work where I had a comparative advantage.  If I had approached a more traditional historian of China at some other institution, I probably would have been ignored, or would have been competing with dozens of other bright young things.

The willingness of faculty and even graduate students to involve themselves in undergraduate student life was one of the most remarkable features of Caltech.  In what other institution would such a distinguished and productive faculty member as Chris Brennen take the time to oversee matters related to undergraduates, and personally intervene in response to a phone call from the parent of a freshman like myself?  Nowadays, at almost all colleges, all of the sorts of things that the Master of Student Houses was responsible for are delegated to full-time, non-academic staff. There might be a faculty committee somewhere setting general policy regarding undergraduate life, but certainly no faculty intervenes in the cases of individual students. And where else would you have a graduate student of the caliber of Ed Callaway serving as an RA in an undergraduate dormitory and helping out in a situation like this?  Ed was a fantastic RA, and it has been wonderful to see him go on to such a distinguished career. Having spent all of my time after Caltech at large institutions like Penn and UCLA where incoming freshmen are processed like the livestock at a factory farm, not handled on a case-by-case basis, the fact that people like Brennen and Callaway intervened in my case seems unimaginable. Sunney Chan was also much loved by many of us for his clear commitment to undergraduate life.

I look back on some house activities with pride, in particular the geeky humor and various good-natured collective efforts, which overall were probably much more common than the behavior I am concerned about that I will discuss in a moment. Many of our puns or other humor was based on the names of mathematical or physical constants, the names of prominent scientists or engineers, or the names of theorems. Beyond Ditch Day, we also engaged in many fun and challenging projects on the rare occasions when we had too much time on our hands, like painting murals. I was especially proud of my role repainting one of the walls in Ruddock with the cover of Pink Floyd’s Dark Side of the Moon one winter break when some classmates and I stayed through Christmas and New Year, and was sorely disappointed a few years later to find out it had been painted over. I was also really impressed with classmates who organized complex, large-scale pranks, like changing the Hollywood sign to read Caltech. That took place while I was there, but I wasn’t involved.

My positive experience reflects one of the unique features of Caltech: the accessibility of outstanding faculty, the remarkable opportunities for undergraduates, and the opportunity to be with fellow students who were from widely varied backgrounds and all very smart and committed.  There are few institutions where undergraduates have such easy access to faculty. Many of my classmates had the opportunity to work in the labs of distinguished researchers, and take classes with them.  In the humanities and social sciences, we had smaller seminar classes with outstanding faculty.  I still have fond memories of small classes with Jim Woodward in Philosophy, Phil Hoffman and many others in History, and Bruce Cain, John Ledyard and Rod Kiewiet in political science. And I had Jean-Laurent Rosenthal as a TA.  And once I was working with Lee, the SURF program and other opportunities available through the Institute allowed me to continue working on research on China during the summer, and pursue studies in Chinese after I graduated.

The shortcomings

As time has gone on, however, I have thought more about how the experience back in the late 1980s could have been better, not for me, but for many others.  Others did not have as positive an experience as I did.  Some transferred out, or else finished but regretted their decision to attend. My subsequent involvement in undergraduate education and observations of student life at other institutions, leads me to reflect back and think about what the shortcomings were, and what could have been improved, if not for me, then for others.

Looking back, the least distinctive and most problematic feature of Caltech for me was one that many alumni are most enthusiastic about: the seven residential houses. Many of us, myself included, have fond memories of the houses, especially the friendships we formed there, but we need to acknowledge that there were serious problems. There were many positive aspects of the houses, most notably the mutual support and the camaraderie, but there were negative features as well, and we should acknowledge them and hope that they have been remedied, or will be remedied. To help get the ball rolling, I am going to share my own concerns and regrets.

Much of the activity in the houses was silly but not especially harmful. There was a lot of immature but largely innocent nonsense like launching butter pats at the ceiling and throwing buns at each other and dragging people into the shower for playing Wagner’s Ride of the Valkyries. While I’m not particularly proud of my involvement in such antics, by the standards of college student behavior, especially in contemporary fraternities, such escapades were relatively tame. One way or the other, improving the house environment doesn’t mean that we all should have been sitting in the lounge in smoking jackets, sipping tea and listening to chamber music or thumbing through back issues of the New York Review of Books.

For too many classmates, however, the environment in the houses was problematic, unpleasant, or even hostile. The central problem was that even though we had high standards for academic conduct, we set low standards for our personal behavior, especially in group settings. When it came to academics, we adhered to an Honor Code that set very high standards for our classwork and fostered mutual trust. When it came to social interactions, however, we didn’t have any common standards at all. This was especially the case when we were acting as part of a group, not as individuals.

The houses at the time I was there in the late 1980s were about what would be expected from throwing a bunch of confused, anxious teenagers together in a dorm without any supervision. Students at Caltech may have been better in science and engineering than students at other colleges, but this doesn’t mean that they were wiser, more mature, or otherwise better equipped to manage their own affairs. There simply wasn’t much in the way that the houses organized themselves or set standards for behavior that was commensurate with the high academic standards we set ourselves. While may not be realistic to expect too much from any bunch of adolescents who were left to themselves, what I have seen of undergraduates in other environments convinces me we could have done much better.

The environment in the houses in the late 1980s certainly discouraged good study habits. House life was a dashpot of energy and focus. Given the choice between participating in whatever happened to be going on in the house, or working on an assignment, many of us chose the former. There was always something going on to divert us when we should have been focused on problem sets or preparation for an exam. Many of us, including myself, screwed around when we should have been studying and then completed our assignments or studied for exams in a panicked frenzy at the last moment. Indeed, all-nighters were as much an immature display of machismo as a rational strategy for completing the task at hand. Looking back, it is pretty clear that classmates who maintained normal sleep schedules and organized their work appropriately actually seemed to do fine. The students who did the best, and have had the most outstanding careers since then, were mostly the ones who had the wisdom and maturity to limit their engagement in house activities, or disengage completely, perhaps by moving off campus.

The insularity and self-absorption of the houses at that time rewarded and encouraged inappropriate or immature behavior. In particular, even though most of us exercised sound judgment in one-on-one interactions, now that I look back, judgment often went out the window in group settings. The environment wasn’t Lord of the Flies, or even the Bollinger Club in Scone College that Paul Pennyfeather encountered in Evelyn Waugh’s Decline and Fall, but it could be inappropriate, and it didn’t have to be that way. Perhaps it was different in other houses, and it has certainly improved rather dramatically since then.

I don’t remember that there was ever any individual or collective effort to discourage childish, boorish, offensive or inappropriate behavior, whether on the part of individuals, or groups. If anything, bad behavior was rewarded with attention. Certainly, when my own behavior crossed the line, there was no check from anyone within the house. During one especially regrettable period during the fall of my junior year that I feel sick about to this day, when I treated a classmate very badly for no reason except that I was immature and self-centered, it was only when my classmate confronted me and told me to stop that I came to my senses and realized how far I had fallen.

Tying this back to my discussion of the house environment, there was no check to my behavior within the house at that time. At the time I was so immature and self-centered I may well have ignored any such checks, so I am not suggesting that my behavior was the fault of anyone but myself. Rather, looking back, I am concerned that there were no checks to my behavior, and that at some level, it seems to have been taken for granted. Others who behaved badly were rarely if ever checked. Sensible classmates who might have been a voice of reason and a model to others were mostly wise enough to move off campus or if they stayed in the house, disengage.

The tolerance or in some cases celebration of boorish behavior created an unpleasant climate for many, most clearly for our female classmates. In retrospect, I am amazed that more classmates, especially our female classmates, didn’t move off campus or transfer out. We spent a great deal of time sitting around and whining about the imbalanced sex ratio and wishing that the Caltech administration or someone other than ourselves would do something about it (this classic scene from the movie Say Anything always comes to mind), but I don’t think we ever had enough self-awareness to realize that we were the problem, and weren’t exactly doing much to make the environment a welcoming one. Reading this recent piece in the New York Times on challenges to women pursuing careers in science recently reminded me of our role in creating an unpleasant environment. That few people complained openly is, of course, not evidence that there was no problem. Why would anyone complain or seek change when it was so much easier to move off campus? Indeed, even though we took it for granted when classmates moved off campus, and sympathized, I don’t remember that we ever sat back and wondered what we were doing to produce a climate that was so unappealing.

The tendency for many of the most focused, thoughtful and overall best put-together classmates to move off-campus or stay on campus but disengage made the pathologies in the houses self-reinforcing. As time went on, the students who behaved badly, or didn’t mind bad behavior, accounted for a larger and larger share of the students who remained in the houses, where as juniors and seniors they set the tone for the incoming classes. The fact that so many classmates moved off campus should have been a signal to us that there was something wrong. We should have been asking ourselves what we were doing wrong that was making the environment unappealing to them.

Though we often liked to excuse our individual or collective behavior as a response to the stress we experienced, claiming that it was a way of letting off stream, that was an awful excuse. As I just noted, at least some of our stress was the product of our own bad time management. And in many ways, whatever we thought we experienced pales in comparison to what I saw some undergraduates at UCLA deal with. Real stress is being a first-generation college student taking a full course load, working full-time to pay tuition and fees, serving as an interpreter for parents or other members of their extended family, and still finding time to volunteer. I encountered many outstanding students at UCLA whose family circumstances and personal situations were so challenging and complex that I simply could not imagine being in their shoes and surviving.

We also gave ourselves a pass on the house environment because rather than complain, classmates who were unhappy were free to move off campus, or remain in the house but disengage. Like many social groups in which boorish behavior becomes ingrained, we treated the absence of complaint or objection as evidence of consent. I don’t know if ever occurred to any of us that just because no one who left or disengaged complained openly, they were all happy with what was happening.

Another rationalization that we sometimes offered for our behavior, that the rules that governed others didn’t apply to us because we were smart and different, is downright troubling. It is true that we were highly selected for our potential to excel in science or engineering. That doesn’t mean we were automatically wiser, more mature, or somehow better in general than everyone else. Given the environment that developed in the houses, it certainly didn’t mean that we were better equipped to be left to ourselves to manage our affairs. And to the extent that we were somehow elite, that should have led us to set higher standards for our behavior, not exempt ourselves.

As proud as we were of ourselves, I don’t remember any house events in which we sought to do something for the community. Houses organized parties, field trips, and other events, but I don’t remember any of them organizing anything like volunteering. While many students did volunteer in various contexts, but they did so individually or as members of organizations like the Y. I don’t remember the houses ever doing anything to promote such engagement with the community. The fact that we were busy with schoolwork is not an excuse. At UCLA, I encountered remarkable undergraduates who were not only full-time students, but were working to put themselves through school, and also found the time to volunteer or otherwise do something for the community.

The administration recognized the problems with the houses and tried to address them on a regular basis, but as far as I know, every time they tried to do something about the really boorish behavior, there was a backlash from at least some reactionary students and alumni. It may be that the administration’s efforts were sometimes a bit ham-handed. That said, it doesn’t reflect well on students and alumni that none of these efforts triggered any serious self-reflection, or any acknowledgment that there indeed might be a real problem. I should add that as far as I know from talking to recent graduates, the situation is much, much better now. The environment sounds much more healthy and supportive, I believe partly as a result of sustained efforts on the part of the administration, but probably because the students have changed.

I sometimes wonder if the way we applied the Honor Code was part of our problem. The Honor Code dictated that we should not take unfair advantage of others. The interpretation was fairly clear when it came to classwork: it forbade anything that smacked of cheating. But looking back, I don’t think we ever asked whether the Honor Code should also apply to the way we treated each other outside of class, in house activities, or in interpersonal relationships. Specifically, as far as I know, irresponsible, immature, mean-spirited, and hurtful behavior was not considered to be covered by the Honor Code. This might have been acceptable if we acted collectively albeit informally to promote appropriate behavior and discourage inappropriate behavior, but we didn’t.

We should have had a broader interpretation of the Honor Code that went beyond avoiding taking unfair advantage of other students in classwork, to emphasize thinking about each other’s feelings, and recognizing that people may have been deeply unhappy even if they weren’t coming out and saying so. Rather than taking it for granted that because we were smart, we knew what we were doing and could handle the situation, we should have asked ourselves whether drinking competitions, initiations for new students, or other practices that were reckless and in some cases dangerous were really commensurate with the high academic standards we set for ourselves.

Periodically, the house system comes up as a topic of discussion, and my experiences 25 years ago make me wish that the discussion was more evidence-based and less anecdote-based. What is striking about the occasional discussions I see about the house system is the passion with which alumni who think of themselves as committed to reason and logic will resort to anecdote, assertion, and analogy to come up with all sorts of imagined benefits of the house system as it existed decades ago. I certainly would like to see more evidence: statistics on graduation rates, percent moving off campus, percent going on to graduate programs, starting salaries for those who don’t go on to graduate school, and various measures for outcomes 10 years out by house. Annual, anonymous surveys of students by house would be useful. Tabulate everything and provide the information to incoming freshmen, donors, and anyone else so they can make their own decisions.

I would also like to see a broad and representative cross-section of alumni discuss their experiences. Caltech has a hold on the popular imagination as a result of being featured in movies and TV shows. It would be great to see a real discussion of experiences, not only from the people who made it through Caltech and remember the houses fondly, but also from the people who weren’t so enamored about their time in the houses. We should be hearing from the people who moved off campus or transferred out, not just listening to the people who stayed in the houses or one of the house-affiliated off-campus facilities for all four years. I’d like to see recollections from a cross-section of alumni who will not only tell stories, but reflect on their experiences and think not only about what was special about the houses, but what could have been improved.

Now that my current institution is discussing the creation of residential houses, and already has some residence halls that have identities of their own, I keep wondering whether there is a way to preserve the unique and positive aspects of residential houses, not just at Caltech but at any institution. Residential houses like we had at Caltech have lots of upside potential. There is camaraderie and mutual support, silly and geeky humor, and the commitment to inquiry and knowledge, while also encouraging more engagement with the world outside, and promoting introspection and self-reflection. While most of the humor and collective activity in the houses in the late 1980s was good-natured, and only a portion of it was offensive or inappropriate, it should be possible to create an even better environment where none of the behavior was offensive or inappropriate, and everyone always felt welcome. The question is not whether the house environment in the late 1980s was harmless, or was somehow excused by the stress we experienced or our self-proclaimed elite status, but whether the house environment could have been better, more welcoming, and more conducive to our personal development.

You may very well ask why I have written such a long-winded reminisce, along with a rambling critique of the environment in the undergraduate houses that has likely changed dramatically in the last 25 years. Not only do I want to exorcise lingering regrets about my own failure to recognize problems at the time and behave better, but I keep thinking that the problems in the houses as they were 25 years ago encapsulate some of the issues that are in the news today. Basically, I think the houses are about what you would expect anywhere if you took a bunch of roughly like-minded teenagers or young adults together and left them to themselves. Maybe not Lord of the Flies, If, or Scone College in Decline and Fall, but problematic nonetheless. The contemporary relevance is not just the seemingly unending string of scandals at fraternities, which I think reflect systemic problems that arise whenever a bunch of young people are left to themselves to form their own, insular society, but also other examples where insular subcultures such as the one in some tech companies create an unwelcoming environment.

As fortunate as I feel I was to have had the opportunity to attend Caltech in the late 1980s, I can’t help but think of ways the experience might have been better, if not for me, then for others.  As much as I liked my experience, my involvement with undergraduate education at other institutions, including 16 years at UCLA, and a few months at HKUST, leads me to look back and think about how it could have been improved. I have no complaints about my time at Caltech, and indeed I owe almost everything I am right now to the unique opportunities that were available when I was there.  Things have changed since I was there, mostly for the better, and from talking to students and recent graduates, it is pretty apparent to me that the undergraduate experience is even better than it was.

Overall, I am glad that I attended Caltech. To paraphrase the lyrics of “Ride Captain Ride” by The Blues Image, I am amazed at the friends I had there on that trip. But I think we could have made the environment in the houses in the 1980s more welcoming and inclusive.

Presentations related to East Asian historical demography at IUSSP 2013 Busan

I’m trying to put together a list of sessions that include presentations focused on East Asian historical demography at the IUSSP meetings in Busan, South Korea, August 26 to September 31, 2013.

Below is what I have found so far, copied and pasted from the IUSSP online programme.  If the session is focused on East Asia, I have copied information for the entire session.  In other cases where a paper focused on East Asia appears in a session with a broader theme, I only copied over the information about the East Asia themed session.

I probably have missed many presentations because I was searching on the names of people who I already knew were presenting.  If you know of any other presentations focused on historical demography in East Asia, please email me and I will add.  Please email me a link to the session (see below for examples) so I can copy and paste the information easily.


Session 186:
Historical demography of East Asia from household registers

Thursday, August 29th 2013
13:30 pm – 15:00 pm
Room 108, Convention Hall, 1st Floor

Chair: Cameron Campbell, UCLA
Discussant: Zhongwei Zhao, Australian National University

  1. Age patterns of migration among Korean adults in early 20th-century Seoul  •  Bongoh Kye, Kookmin University; Heejin Park, Kyungpook National University
  2. Demographic Responses to Economic Stress and Household Context in Three Northeastern Japanese Villages 1708-1870  •  Noriko Tsuya, Keio University; Satomi Kurosu, Reitaku University
  3. Household Context and Individual Departure: The Case of ‘Escape’ in Three ‘Unfree’ East Asian Populations, 1700-1900  •  Hao Dong, Hong Kong University of Science and Technology; Satomi Kurosu, Reitaku University; James Lee, Hong Kong University of Science and Technology
  4. Marriage, household formation and social mobility in colonial Taiwan: A new occupational database for Taiwanese family history.  •  Wen-shan Yang, Academia Sinica; Xingchen C.C. Lin,Institute of European and American Studies, Academia Sinica

Session 264:
Early life stress and later health

Friday, August 30th 2013

15:30 pm – 17:00 pm

Room 102, Convention Hall, 1st Floor
Chair: Tommy Bengtsson, Lund University
Discussant: Alain Gagnon, Université de Montréal

Session 200:
EurAsian history of population and family

Thursday, August 29th 2013
15:30 pm – 17:00 pm
Room 107, Convention Hall, 1st Floor
Chair: Diego Ramiro Fariñas, IEGD-CCHS Spanish National Research Council (CSIC)
Discussant: Jérôme Bourdieu, INRA-PSE and EHESS

  1. Mortality and living standards in Asia and Europe, 1700-1900  •  Tommy Bengtsson, Lund University; James Lee, Hong Kong University of Science and Technology; Cameron Campbell, UCLA
  2. Migrations in the Adjustment between Population and Resources. Eurasian Contributions  •  Michel Oris, Université de Genève; Martin Dribe, Lund University; Marco Breschi, University of Sassari
  3. Prudence and Pressure: Reproduction and Human Agency in Europe and Asia, 1700-1900  •  Noriko Tsuya, Keio University; Feng Wang, Brookins-Tsinghua Center for Public Policy; George Alter, University of Michigan; James Lee, Hong Kong University of Science and Technology
  4. Similarity in difference in pre-industrial Eurasian marriage  •  Christer Lundh, University of Gothenburg; Satomi Kurosu, Reitaku University

Session 270:
Urbanisation, economic development and family transformation through history

Friday, August 30th 2013
15:30 pm – 17:00 pm
Room 108, Convention Hall, 1st Floor
Chair: Lionel Kesztenbaum, Institut National d’Études Démographiques (INED)
Discussant: Jérôme Bourdieu, INRA-PSE and EHESS

The future of marriage in China

Reading Leta Hong Fincher’s CNN piece on changes in women’s attitudes about marriage in China reminded me of a prediction that I have been making for the past two or three years to anyone who will listen:

Within a decade, marriage patterns in mainland China will resemble those everywhere else in East Asia, with high proportions of women marrying late or not at all. Similarly, high proportions of men, especially poorly educated ones with poor economic prospects, will be unable to marry. This is already happening in Beijing, Shanghai, and other prosperous cities. Based on what happened in Taiwan, South Korea, and Japan after 1990 or so, I am guessing the changes, when they occur, will be sudden and dramatic. These changes will be much larger and more important than any of the ones associated with imbalanced sex ratios at birth, and would occur even if the sex ratio at birth were normal.  More speculatively, I expect that mainland China will continue to resemble other East Asian societies in terms of having very low rates of non-marital childbearing. As proportions married collapse, the fertility rate will fall even further.

When I look at what is happening in mainland China right now, and what has happened elsewhere in East Asia, this all seems obvious.  All of the factors that seemed to be associated with rapid marriage change elsewhere in East Asia seem to be present in mainland China right now: dramatic and rapid economic and social change, rising levels of female education, changing patterns of inter-generational relations,  and changing expectations about career and marriage on the part of both young men and women.

One piece of indirect evidence suggests that there is pent-up demand or at least curiosity about the possibilities associated with delaying marriage, at least for women: at least according to Joy Chen’s website, the Chinese version of her straightforwardly titled book Do Not Marry Before Age 30 seems to be selling well.   I haven’t read the book and probably never will since I am not part of the target audience, but it is refreshing to see someone writing a book that is the exact opposite of the usual nonsense offering women advice on how to bag a man, on how to avoid spinsterhood, and so forth.

Nevertheless, many observers, Chinese and foreign, seem wedded in some vague way to a notion that ‘tradition’ will somehow prevent the same changes taking place in China that took place elsewhere in East Asia.  ‘Tradition’ and ‘cultural values’ did not serve as a bulwark against marriage change elsewhere in East Asia in the last two decades, so I don’t understand why they would prevent change in China now.  Indeed they have not done much to prevent changes in marriage patterns among young adults in China’s largest and most developed cities, notably Beijing and Shanghai, where the average age at marriage is already high, and the proportions of people marrying are falling.  ‘Tradition’ and ‘culture’ may help us understand why specific phenomenon persist to the present, but they have a terrible track record of predictors of future behavior.  Sometimes this assumption of continuity is explicit, but in many cases it is implicit, for example, in the assumptions about marriage preferences that demographers simulating the effects of sex ratio imbalances build into their projection models.

The best example of how useless tradition is as a predictor of future trends is probably the recent rise in divorce rates in China.  Rates of divorce in China used to be very low.  Most people, including myself, assumed that they would remain low, because of ‘culture’ or ‘tradition’ that encouraged unhappy couples to remain married.  Yet when China changed divorce laws around a decade ago to make it easier to divorce, rates skyrocketed.  Low divorce rates apparently had more to do with institutional and legal barriers than with any ‘culture’ or ‘tradition’ that discouraged divorce.  Rapid increases in divorce rates elsewhere in East Asia over the last two decades were similarly unexpected.

Somewhat perplexing for me is the continuing concern on the part of pundits and academics about a topic that for me is a not much more than a side issue: the potential effects on marriage of imbalanced sex ratios at birth.  This is not to dismiss concern about imbalanced sex ratios at birth.  There are many important reasons to be concerned about imbalanced sex ratios at birth, not the least of which is what they reflect about gender attitudes.  However, I think the effects of imbalanced on sex ratios on marriage patterns will turn out to be fairly small because the affected cohorts will be coming of age at a time when much more dramatic shifts in marriage patterns are occurring.  No matter what the sex ratio of births is or was, the numbers of men and women not marrying is probably going to increase dramatically.  While some of the men who do not marry might be unmarried because of the imbalanced sex ratio, many more will be unmarried because none of the single women are willing to marry them, or they themselves choose not to marry.

As to the implications of what I think will be a very rapid shift in marriage patterns in mainland China, I can only speculate.  It certainly won’t be a disaster.  Other places in East Asia seem to have experienced these rapid shifts in the last decade or two without collapsing.  I would guess that twenty-somethings in China will spend more and more of their time working, spending time with friends, and pursuing individual interests, and less and less time meeting and assessing potential spouses.  And I suspect that as elsewhere in East Asia, members of senior generations will finally realize the world has changed, and stop pressuring their adult children, nephews, and nieces to find a spouse and have children.  As I noted earlier, in light of the very low levels of nonmarital childbearing in China, the most important effect of delayed or foregone marriage there may be further reductions in the birth rate.

I would certainly like to see commentators, journalists, pundits, academics, and policymakers acknowledge the possibility that marriage may change rapidly.  At the very minimum, demographers should allow for a wider range of possibilities for marriage preferences when they run projections to examine possible impacts of imbalanced sex ratios.  If we’re lucky, the degrading and artificial term ‘sheng nv’ will be banished from the language, and will no longer be used either by domestic commentators, or foreign journalists who uncritically accept the term as an organic one and reuse it, even though it was actually coined and put into widespread use as part of a systematic effort to belittle unmarried women.  Best of all would be accommodation on the part of the government, commentators and senior generations to the changing reality, and abandonment of efforts to pressure young people, especially women, into marrying by a certain age.

Academics and policymakers need to engage in a thoughtful and open-minded assessment of why marriage is changing that goes beyond repeating tired and sometimes offensive platitudes, especially ones about young women having expectations that are too high, or young people in general being too selfish, irresponsible, and consumption-oriented.  The former is especially unappealing because implicitly, it argues that women should be the ones who make sacrifices in order to marry, not men.

Serious consideration needs to be given to the fact that marriage may be unappealing to women because labor markets and household gender roles combine to make the prospect of being a working mother especially unappealing.  In many China, as in many societies, women are responsible for many domestic duties including child care and elder care, even if they are also working.  The financial burden associated with buying a home and paying for a child’s education, meanwhile, make staying at home unrealistic as an option.  Given a choice between remaining a single and working, or being married and working and doing most of the domestic work, remaining single seems an eminently sensible option.


More data doesn’t automatically lead to deeper understanding…

Finally, someone has very publicly thrown cold water on the wild claims made for the potential of ‘big data’. I like the title: “Why Big Data is Not Truth.”

It seems like every week now, I hear or read about someone in the news, typically an engineer or a computer scientist but very rarely a social scientist, breathlessly extolling the potential of ‘big data’ to yield transformative insights into social phenomena or individual behavior.  Almost inevitably this is illustrated with an utterly banal example of a finding, usually fit for nothing more than a cocktail party conversation, like perhaps people with small heads (as inferred from the sizes of the hats they buy) consume unusually large numbers of mangoes on Tuesdays and Thursdays.  That is a made up example, but to me is representative of the sorts of trivial and atheoretical ‘findings’ that too often are hauled out in puff pieces about the golden world of opportunity offered by big data.  The banality of these ‘findings’ illustrate the fundamental challenge that we face when we seek insight into underlying processes or mechanisms from observational data on people: describing a relationship is not the same as understanding it, or explaining it.

Correlation is not causality, and the problem doesn’t disappear no matter how much data we throw at it.  Whether a dataset contains one thousand records with one hundred variables, or one trillion records with one million variables, if it is observational data collected ‘in the wild’ or via a survey, any association observed in it is still just an empirical finding, albeit a potentially important one, until it is replicated in different settings with different data, and has a credible explanation.  A larger dataset or more variables don’t magically compensate for the fact that the data is based on observation, as opposed to generated by a controlled experiment with random assignment to treatment and control group.  If we’re lucky, there may be something in the data that can be thought of as an exogenous shock experienced by a random subset of the subjects, in which case differences between subjects who experienced the shock and those who didn’t may be interpreted as a genuine effect of the shock.

Lest anyone accuse me of being prejudiced against large datasets with many variables, let me be the first to say that some of my best friends are large datasets.  Indeed, for the last twenty years, I have helped create large historical datasets, analyze them, and release them to the public in the hope that others will be able to find applications for them that I could never imagine.  We have created datasets that record people who lived in China in the eighteenth and nineteenth centuries from birth to death, recording at regular intervals their social and economic status, their household and community context, and their demographic behavior and socioeconomic attainment.  I will probably continue helping to compile and analyze such datasets for the rest of my career, because that is how I roll, and because no one has showed up at my doorstep with a suitcase full of cash that would be mine if only I would join them on some sort of outlandish caper like you would expect in a Ross Thomas novel.

It is this experience with large datasets that has made me wary of the more extravagant claims for big data.  My collaborators and I have learned a great deal about life in the past in China, and about demographic behavior in general, from careful analysis of these data.  I want to continue compiling, analyzing and release these and other data.  I am sure that others who work with the data we have publicly released will make even more spectacular and important discoveries, not just about China, but about human populations more generally.

All the effort we have expended in the construction and analysis of these large datasets has made me painfully aware of what it is realistic to hope for.  We can describe important empirical regularities in great detail.  Many of these are of considerable interest in their own right, even if we can only suggest possible explanations for them, because they illuminate life in another time.  They are worth publishing in the same way that some fascinating but inexplicable astronomical phenomenon is worth publishing.

For some findings, an explanation is fairly straightforward and very credible.  We find that married women who had not yet borne children for their husbands, or had borne only daughters, had higher death rates than women who had borne sons.  This makes sense, since in the past in China, the primary responsibility of married women was to bear and then raise an heir for their husband’s family, and until they had at least borne a son, they were probably on a sort of probationary status, with limited access to family resources.  Once they had borne a son, they were probably fully enfranchised members of their husband’s household.  And we find that death rates rose and birth rates fell when grain prices were high, presumably because of economic adversity.

If we’re lucky, we find something that may have some relevance for the contemporary era.  For example, we found that babies born soon after their elder siblings (within 24 months) had elevated death rates in old age.  We speculated that this reflected the effects of maternal depletion on the newborn.  Linked to contemporary results on apparently adverse short-term consequences of a short preceding birth interval, perhaps this might tell us something important about human physiology.

But we also find perplexing results that are robust to alternative specifications and persist no matter what subset of the data we look at, but we can’t explain.  We find that high status males actually had higher death rates than other males.  We don’t know why, and can only speculate.  Perhaps their status and wealth allowed them to make what our son’s elementary schools refers to as ‘bad choices’: maybe they squandered their money on debauchery in Shenyang (at the time, Fengtian) and died early as a result of liver failure or tertiary syphilis.  We just don’t know.

More relevant to my rant, we periodically observe statistically significant associations, some of them quite fascinating, that disappear when we expand the dataset, or use a different subset of the data, or make slight modifications to our model.  If I had a dime for every association like this that we had come across, I’d be a rich man.  I suppose that if the result were interesting, we could come up with some post hoc rationalization of why it only appears in a specific subset of the dataset, when the model is specified in a very particular way, and try and publish it, but that sort of thing makes us queasy, because of our awareness that if you measure enough associations, the phenomenon of mass significance will lead at least some of them to appear to be significant, ever if they aren’t.  Again, we feel more comfortable making a claim if a result appears under multiple alternative specifications of the model, and across different subsets of the dataset.

I’m happy to continue plugging away with this sort of analysis indefinitely because I feel like an astronomer, except that instead of peering through a telescope at distant stars or galaxies and then trying to work backwards to develop an explanation for the regularities I observe, I am observing people in the past who I will never meet (unless I can buy a Tardis on Craigslist from a dissipated Time Lord whose alimony, child support, gambling debts and coke habit have made him desperate for money) and trying to discern and provide explanations for the regularities that I observe.  Some of the explanations or interpretations I come up with may be overturned as people uncover even better data or apply better methods, but I am pleased to have made some incremental contribution to our understanding of life in the past.

If the starry-eyed proselytizers of the salvation to be delivered to us by collection and analysis of ‘big data’ were willing to put down their Kool-Aid for a moment and limit themselves to a more cautious prediction that large quantities of data will allow us to observe empirical regularities and every once in a while come up with some genuine insight about the determinants of specific behaviors, I would be happy.  But too often, ‘big data’ proselytizers seem to imagine a future like the one in Isaac Asimov’s Foundation trilogy which I enjoyed so much in middle school, where simply by sifting through enough data, it is possible to predict not only individual behavior, but social change, decades or centuries in advance.  To put it mildly, they’re getting somewhat ahead of the field in terms of the optimism about the possibility to go from observation of individuals to predictions about their behavior.

To me, the biggest challenge to the use of ‘big data’ is some version of the phenomenon of ‘mass significance,’ which I referred to earlier in the context of our own experience.  If you have hundreds or thousands of variables that in reality have nothing to do with each other, and in fact are all series of random numbers generated by die rolls or some other process, but you calculate pairwise correlations between them, inevitably by luck of the draw some percentage of them will appear to have an association that is statistically significant at some threshold.  But if you collect the same data again in another time period, a completely different set of variables may be associated with each other.  In other words, what appears to be statistically significant association in data collected in one time period, will not have any association in a second time period.   Companies that find that people whose last names end in Y or who like to fill their cars with gas on Wednesdays also tend to be especially receptive to offered discounts on artichokes in one time period, may be disappointed in the next time period when they offer special deals on artichokes to such people.

Another problem, well known from previous analysis of observation data, is the possibility that observed relationships are not causal, but reflect complex influences of other variables that we don’t observe.  These might be variables that affect the chances of particular types of people being observed in our data, or variables that affect the values of the variables that we do observe.  Whether spurious relationships observed in data are the result of selection biases or the influence of an unobserved variable on the variables that we do observe, any relationship we do observe is unlikely to be causal, and changing behavior or making policy based on it may be premature, to say the least.  And in spite of the claims made for various approaches, I don’t there is any statistical voodoo that fixes the situation, and allows anyone to make solid claims of causality from purely observational data, except in very limited situations where at least one of the variables appears to be genuinely exogenous, in which case instrumental variables or other approaches may offer some insight.

This would all be fine if the goal of sifting through large amounts of data and identifying regularities was solely to develop a better understanding of the world, in the same way that astronomers sift through enormous amounts of data to development a steadily better understanding of the universe.  There would not be any harm if all we wanted to do was observe empirical regularities, hypothesize about relationships, and then wait to see if the next round of data collection confirmed our hypotheses.  I love doing that with historical data, since if I am wrong, no one is going to die because of some misguided policy that I propose, because everyone I study is already dead.  And of course I love doing that with contemporary data.  I don’t work that much with contemporary data, but others do, and we learn all sorts of remarkable things.

The scarier and probably more likely scenario, however, is that analysts will attempt to translate empirical regularities observed in ‘big data’ into government policy, company strategy, or individual behavior change without deep consideration of the possibility that the observed relationship is spurious, and perhaps can’t even be explained.  At best, this will lead to wasted effort, because the relationship of concern was spurious to begin with, and changing policy or changing behavior will have no effect.  In a worst case scenario, however, it could be destructive.

We already have many examples of policy or at least recommendations based solely on observational data had downright pernicious effects.  Hormone replacement therapy comes to mind.  Large observational studies based on what at the time was ‘big data’ led to a conclusion that hormonal replacement therapy would reduce the risk of breast cancer.  Eventually, better designed studies revealed that hormonal replacement therapy didn’t reduce the risk of breast cancer, and probably increased it.  That is but one example.  The health and public policy literature is littered with other examples of recommendations for diet change or other lifestyle change that were made based on survey studies or other observational studies, but were not borne out in later, more rigorous studies.

I am terrified that as we move forward into an era of ‘big data’, results from the correlations of millions of variables with each other will be reported uncritically, and we will be subjected to an endless stream of breathless reports based on observed but in the end spurious relationships, perhaps that people who eat mangoes on Tuesdays are more likely to be struck by lightning, or people who last names contain three or more vowels are more likely to buy yellow cars, etc.  If you think that is paranoia, just consider how many studies are already published every week that suggest that some slight diet modification raises or lowers the chances of some obscure cancer, based on observational data.

What is to be done?

I’m all in favor of continuing to collect and analyze data, including ‘big data’.  Every once in a while, a relationship may emerge that really matters.  And in many cases, even empirical regularities are useful and interesting to observe, even if we can’t explain such regularities.  Traffic planners may find it very useful to find out that a certain street is especially likely to be clogged with traffic on days of the month that are also prime numbers, even if they have no idea why.  Companies may find it very useful to know about patterns in customer behavior, even ones they can’t explain.

That said, we need to retain some healthy skepticism about the implications of associations observed in the analysis of ‘big data.’  Basically, we need to accept that ‘big data’ is not a magic bullet that makes more fundamental issues about inference vanish.  I’m doubtful based on the results of effort by social scientists that having orders of magnitudes more data will suddenly allow us to predict individual behavior with great specificity, or predict dramatic social changes  Life will probably remain stochastic at both the individual or aggregate level.  We may develop models that are useful for predicting the frequency of particular types of behavior in a sort of actuarial fashion, where we may predict that on average X percent of people with specified characteristics will do Y over some time period, but I doubt that we will ever have models that predict that individual i who has specified characteristics will do something on a specified date.  In other words, we may have lots of data that may be useful in actuarial calculations about average outcomes for aggregates of people, but I doubt we’ll get to the point where we can reliably predict the behavior of specific individuals in the short term.

The nightmare scenario is that a bad situation in which we already have almost weekly news reports based on dubious, never-replicated analyses suggesting that doing X increases our chances of suffering Y will turn into a worse situation where we have a daily or hourly stream of results claiming that individuals who do X raise their risk of experiencing Y, or that companies or cities, counties, or states that implement policy X will likely experience outcome Y.  Data mining may lead to a spasmodic, panicked, ever changing set of recommendations to individuals, companies, or governments, that eventually produces cynicism, and perhaps a backlash in which nobody believes anything based on empirical observation.

At the very least, this suggests a need for a very high bar for claiming that observed associations are suggestive of causal relationships that in turn lead to policy prescriptions, or recommendations for changes in behavior.  Ideally, associations will need to observed in multiple, independent datasets, and will need to have some sort of plausible account for the underlying mechanism or process generating the relationship.  In an ideal world, empirical observations of potentially important relationships would be followed up my more rigorous analysis like the ones much in vogue among economists that would try to establish causality, or at least provide some evidence for it.

This isn’t to say that we need to fetishize causality and turn their noses up at any analysis that doesn’t rely on instrumental variables, a natural experiment, or some sort of randomized field experiment.  Rather, the prescriptions for behavior or policy that we develop based on observations from big data have to be calibrated according to the import of the outcome, the plausibility of the proposed underlying mechanism or process, and the cost of the proposed change in behavior or policy.  If analysis of ‘big data’ suggests that we can people who avoid wearing plaid on Thursdays appear to have a lower risk of being bit by rabid squirrels, it wouldn’t cost much to avoid wearing plaid on Thursdays for a few months until the result is confirmed.  But if analysis of ‘big data’ suggests that carrying around bricks of depleted uranium substantially reduce our chances of being attacked by seagulls, we might want to hold off doing anything pending some careful thought and further investigation.

Along these lines, it would be a good idea for the engineers and computer scientists who are plunging ahead with the collection and analysis of ‘big data’ to learn from the experience of social scientists who have been grappling with the limitations of observational data for decades.  As Bismarck said, ““Fools learn from experience. I prefer to learn from the experience of others.”  Those who are now collecting and analyzing ‘big data’ should learn from the experience of social scientists, not by reinventing the wheel and repeating the same mistakes social scientists have made for the last few decades.  The most important lesson is perhaps to be humble, and be aware of the limitations of observational data.  Perhaps we should invite computer scientists or engineers working with social data into our research methods classes, not to teach them new statistical techniques, but to teach them the fundamentals of study design, like the difference between experimental and observational designs, the circumstances under which an inference of causality may be justified, and the dangers posed by selection processes and omitted variables.

Conversely, as social scientists, we need to incorporate training in the management of large and complex datasets into the undergraduate and graduate social science curriculum.  Right now, our quantitative training typically provides students with predigested datasets that don’t require any manipulation, and then teaches them a variety of flavors of regression, some very exotic, that they can use to estimate models on those datasets. We almost never offer systematic training to students in how to manipulate those datasets to create new variables.  And we almost never offer any systematic training in how to take ‘found’ data (perhaps the output from a web server log, or administrative data) and suck it into STATA or some other program, and organize it.

As a result, we have students who know how to take a dataset that someone hands them and run a five stage most squares regression with a cubic spline for age, income instrumented by the level of solar background radiation, and a Heckman sampling correction.  But if you hand them a more complex longitudinal dataset like CMGPD that may require some simple manipulation to create variables measuring household or community characteristics to include in a discrete-time event history analysis via a simple logistic regression, they’re stuck.  In the years I spent teaching regression, it was clear to me that for many students, the biggest problem was not in choosing variables, estimating a regression, and interpreting results, but in preparing the data for the estimation.

There are already many excellent social scientists who already create and work with absolutely ginormous datasets, I would speculate that when it comes to the techniques for managing those large and complex datasets, most of them are either self-taught, collaborating with computer scientists with expertise in database management, or came in from other fields.  But we can’t rely on graduate students or faculty with relevant skills for manipulating large datasets to keep falling from the sky the way they have in the past.  We have to produce them systematically.

Now to put my tinfoil hat on, another serious concern I have about ‘big data’ is that it may not turn out to be that useful in terms of improving our understanding of processes and mechanisms by which individual context and characteristics affect individual behavior or outcomes, but will likely prove to be a goldmine for post hoc extraction of information about individuals’ past behavior that could be used to embarrass or blackmail them.  In other words, it may turn out that big data leads to little in the way of important, fundamental insights about human behavior, but will facilitate the creation of individual dossiers full of tidbits that can be hauled out to embarrass people whenever they seek political office, blow the whistle on their employer, or who knows what.  Various totalitarian states collected enormous amounts of information on their citizens via surveillance and the reports of informers.  I’m not sure that the data ever allowed any of the states to predict the individual behavior or social change.  If the data could have been exploited to make accurate predictions about individuals or society itself, some of those totalitarian states might still be around.  What we learned however is that the information was less useful for prediction than for control.

Note: I have been going back and modifying this as I have had more thoughts, or received feedback.  An exchange with Mark Hayward was particularly inspiring because it drew attention to the need for social scientists to develop a response.

Opening old Excel files in STATA 12

I ran into some importing old Excel files into STATA 12.  Since I thought others would probably be encountering the same problem, I decided to write a blog post about it.

We’re getting ready to produce a draft release of our China Multigenerational Panel Dataset – Shuangcheng (CMGPD-SC) so that users can kick the tires and report problems before we submit a final version to ICPSR for dissemination there.

As part of the preparation, we wanted to take advantage of the new facility in Stata 12 that allows Excel files to be opened directly.  Our ‘raw’ data consist of Excel spreadsheets entered by our coders, one per register.  Registers are annual or triennial.  For our Liaoning dataset, we have 737 registers coded.  For Shuangcheng, we have 338.  Previously, our procedures for automating the import of the registers in Stata were clumsy, and rarely survived upgrades to Stata or Windows.  At one point we were using the odbc command to loop through and read all the registers, but that broke when we moved to computers that were running 64 bit windows.  Then we wrote a macro to loop through the Excel files and write them to tab-delimited text fields, which STATA could read.

Converting our programs to use import excel was fairly straightforward.  Basically it just mean substituting import excel for insheet.

When we began running the programs, however, STATA was reporting that it could not load files, and came back with an r(603).  I did notice it could open all .xlsx files, but had more trouble with .xls files.  I began to wonder if the problem was with older versions of Excel files.  Perhaps the import capability assumed a recent version of Excel.  I saved some of the files as .xlsx files and sure enough, STATA could read them.

At that  point, it became necessary to convert the thousand or so files that were in older versions of Excel to .xlsx files.  Opening them one by one and saving them to .xlsx would be impractical.

I poked around on the net, and found that Microsoft had an Office File Converter tool available for download.   Here is an introduction and here is the download.  The tool requires that the Microsoft Office Compatibility Pack be installed.  By modifying the ofc.ini file, and adding the name of a folder under [FoldersToConvert] it is possible to direct OFC to attempt to convert all the old .xls files it finds in a specified folder to .xlsx.

fldr=C:UsersCameronDropboxSharedSkydriveCMGPD DataLN

Here is what my [ConversionInfo] section ended up looking like:


I ran ofc and sure enough, it chugged through the files and converted them and placed them in a directory under the original folder that was called Converted.

Now Stata is happily chewing through the converted files.




First publication using the CMGPD-LN public release!

Congratulations to Wang Lei at the Chinese Academy of Social Sciences’ Institute of Labor and Population Economics!  Wang Lei has just published what we believe is the first publication using the public release of the CMGPD-LN that doesn’t have one of us as a co-author: The paper is a study of bachelorhood in northeast China in the eighteenth and nineteenth centuries, taking advantage of the excellent data on marital status available in the CMGPD-LN. It appeared in 人口与经济 (Population and Economics), which is one of China’s major social science journals.

We all expect that this will be just the first of many publications by others that make use the CMGPD-LN.

Here is the full citation for anyone who is interested:

Wang Lei.  2013.  清代辽东旗人社会中的男性失婚问题研究-基于中国多世代人口数据库—辽宁部分( CMGPD-LN) (A Study of Males’ Out-of-marriage in Bannerman Society of East Liaoning in Qing Dynasty: Based on CMGPD-LN).  人口与经济 (Population and Economics).  2013(2):35-43.

And for anyone who is interested, here is a paper we published on male marriage, which Wang Lei was kind enough to cite: