CMGPD-SC now available at ICPSR!

I am pleased to report that the China Multigenerational Panel Dataset-Shuangcheng is now available for download at ICPSR:

http://www.icpsr.umich.edu/icpsrweb/ICPSR/studies/35292

We would like to thank everyone who worked with the draft versions of the release and documentation and reported problems. If you have been working with a draft version of the release downloaded from my own website, I recommend strongly that you download the official release and begin working with it. It incorporates a number of fixes to address problems reported by users.

We anticipate releasing the Landholding File sometime this fall. This will include landholding records that are linked to individuals recorded in the registers. We will also be releasing updates to the User Guide and other documentation over the next year.

Over the next year, we will also overhaul the variables related to official position to reflect new information located in the registers by Shuang Chen. We will also release a price time series.

Preparation of the CMGPD-SC and accompanying documentation for public release via ICPSR DSDR was supported by the National Institutes of Health, Eunice Kennedy Shriver National Institute of Child Health and Human Development (NICHD) Grant no. R01 HD070985 “Multi-generational Demographic and Landholding Data: CMGPD-SC Public Release.” Contents are solely the responsibility of the authors and do not necessarily represent the official views of the NICHD.

 

CMGPD Training Guide Video: From the Original Registers to the Database

 

I recorded a third video today. This narrates the portion of the Training Guide that discusses the process by which we turned the original registers into the CMGPD-LN database. There is some discussion of the original format and content of the data, and the implications for analysis. In particular, there is discussion of the origins of the variables for entries and exits that are the basis of event-history analysis.

CMGPD Training Guide Video: Strengths and Weaknesses of the CMGPD-LN

I recorded another narration from the CMGPD Training Guide. This one is for the section that discusses the strengths and weaknesses of the CMGPD-LN. The discussion of strengths focuses on features of the CMGPD-LN that make it unique among sources for the study of historical demography. The discussion of weaknesses highlights some areas where caution needs to be exercised when carrying out analysis. Visitors in China may find it more convenient to view the video that I uploaded at Tudou.

Other videos will be available at the Youtube playlist devoted to CMGPD Training Guide videos.

Summer 2014 China Multigenerational Panel Dataset Workshop at SJTU (English announcement)

The 4th China Multigenerational Panel Dataset Workshop
Shanghai Jiaotong University, Minhang Campus
Shanghai, China

July 14-25, 2014

中文版

The Center for the History and Society of Northeast China at the Shanghai Jiaotong University School of Humanities will hold its 4th summer China Multigenerational Panel Data workshop from July 14 to July 25.

The workshop will focus on introducing the China Multigenerational Panel Datasets (CMGPD) as sources for the study of demography, stratification, and social and family history. These include the China Multigenerational Panel Dataset – Liaoning (CMGPD-LN) and the China Multigenerational Panel Dataset – Shuangcheng (CMGPD-SC).  The CMGPD have been released via the Inter-university Consortium for Political and Science Research.  The latest versions of the CMGPD document are available for download.

The CMGPD datasets have many unique features that make them useful not only for the study of Chinese population, social, and family history, but for the study of demographic, social and economic processes more generally.  Their features also make them useful as testbeds for researchers developing novel quantitative techniques.  The datasets are longitudinal, multi-generational, and structured at multiple levels, including the individual, the household, the kin group, the community, the administrative unit, and the region.

UCLA Professor of Sociology Cameron Campbell will be the primary lecturer. Guest lecturers will include Distinguished Professor and Dean of Humanities and Social Sciences at the Hong Kong University of Science and Technology James Lee; Yuxue Ren, Professor of History at Shanghai Jiaotong University; and Dong Hao, PhD student at the Hong Kong University of Science and Technology.

This class is intended to 1) introduce researchers to the CMGPD datasets and help them decide whether they may be useful in their own studies, 2) give current users an opportunity to learn more about the origin and context of the data, and 3) give participants basic instruction in the use of STATA to describe, organize and analyze the data.   Researchers who have already started using the CMGPD-SC or CMGPD-LN are welcome to attend and take advantage of the opportunity to discuss any questions they may have with Lee, Campbell, and others who were involved in the creation of the dataset.

Lectures and discussion will focus on 1) the historical, social, economic and institutional context of the populations covered by the data, 2) key features of the data, and 3) potential applications.  There will be optional sessions to introduce the Training Guide and demonstrate basic procedures for downloading the data from the website and loading it into STATA.

Please note that while there will be basic instruction in the use of STATA to organize and analyze the data, this is not intended as a class in STATA, or introductory statistics. Students looking specifically for instruction in STATA, statistics, or data management are encouraged to look elsewhere. Again, the class is intended for participants who want to assess whether CMGPD is suitable for their research interests, or are already considering the use of the CMGPD and seek basic instruction in the use of STATA to manipulate and analyze it.

The workshop will include daily exercises to introduce key features of the data, and STATA techniques for taking advantage of these features. Participants will also complete a small project of their own design using the data and present it on the last day of the workshop.

If any non-Chinese speakers enroll, the lectures will be in English.  If the participants all speak Chinese, lectures may be in Chinese, or a mixture of English and Chinese.  Discussion will be in English and Chinese.

The Shanghai Jiaotong University Center for the History and Society of Northeast China was established as a research unit by a collaboration of the Shanghai Jiaotong University (SJTU) School of the Humanities and the Hong Kong University of Science and Technology (HKUST) School of the Humanities and Social Sciences.

Datasets

China Multigenerational Panel Dataset – Liaoning (CMGPD-LN)

The CMGPD-LN is an important dataset for the study of China’s family, social and demographic history, and for the study of demography and stratification more generally. The dataset is suitable for application of a wide variety of statistical techniques that are commonly used in social demography for the analysis of longitudinal, individual-level data, and available in the most popular statistical software packages. The dataset is distinguished by its size, temporal depth, and richness of detail on family, household and kinship context.

The materials from which the dataset was constructed are Shengjing Imperial Household Agency household registers held in the Liaoning Provincial Archives. The registers are triennial. Altogether there are 3600 of them. We transcribed a subset of them to produce the CMGPD-LN, which spans 160 years from 1749 to 1909. At present, the dataset comprises 29 register series, and consists of 1,500,000 records that describe 260000 individuals over seven generations. The CMGPD-LN is accordingly an important resource for the study of historical demography, sociology, economics, and other fields.

The CMGPD-LN and associated English-language documentation are already available for download at ICPSR.

China Multigenerational Panel Dataset – Shuangcheng (CMGPD-SC)

The CMGPD-SC covers communities of recent settlers in Shuangcheng, Heilongjiang in the last half of the nineteenth century and beginning of the twentieth. It contains 1.35 million records that describe 100,000 people. The registers cover descendants of urban migrants from Beijing and rural migrants from neighboring areas in northeast China who came to the area in the first half of the nineteenth century as part of a government organized effort to settle this largely vacant frontier region. One of the distinguishing features of this dataset is the availability of linked, individual-level landholding records for several points in time. The data also include a rich array of other indicators of household and family context and socioeconomic status.

Pending release of the CMGPD-SC through ICPSR, the data are available for download here.

Information

Dates
Monday, July 14, 2014 to Friday, July 25, 2014

Location
Shanghai Jiaotong University School of Humanities (SJTU Minhang Campus, Shanghai)

Application deadline
May 1, 2014

See link below to download application

Application procedure

Please send your personal statement, curriculum vitae, and application form (English or 中文) as attachments to chinanortheast@gmail.com.

Applications from faculty, postdoctoral researchers and graduate students are welcome. Applications from graduating college seniors will also be considered if they have already been accepted into a graduate program beginning fall 2014.  In that case, the application should include a copy of their graduate school acceptance. Any other interested parties should contact our staff at chinanortheast@gmail.com before applying to see if they will be considered.

Participants should be able to speak or read Chinese or English.  No prior experience in statistics, demography, or Chinese history is required.  Applicants must explain the reasons for their interest in the data in their application, and should demonstrate that they have background, experience or interests that in some way are relevant.

Participants who are Chinese nationals will have accommodations. Participants who are not Chinese nationals will receive assistance with arranging accommodations, and will receive a housing subsidy to help offset their costs. Participants who want other accommodations will have to arrange them on their own and will be responsible for all associated costs.

Participants should bring their own computer.

Students are responsible for all travel and local expenses, health care expenses, and other incidentals. Participants coming from abroad are strongly encouraged to confirm that their health insurance offers international coverage, or purchase travel health insurance.

Participants who are not Chinese nationals will need to obtain visas. We will issue invitation letters to facilitate the visa application. We strongly urge that accepted participants who need visas begin the application process as soon as possible after they are notified of their acceptance.

At present we expect to be able to accommodate 25-30 participants.

Links

Required Reading

Read the following before the workshop begins.  The highest priority are the specified pages in in the CMGPD-LN and CMGPD-SC User Guides.

Documentation

The documentation below is available here.

  • CMGPD-LN User Guide.  English pages 1-54, 90-96 or Chinese pages 13-64, 96-101.  Skim the descriptions of variables to look for ones that may be relevant to your research.
  • CMGPD-SC User Guide.  English pages 1-47. Again, skim the descriptions of variables to look for ones that may be relevant to your research.
  • CMGPD Training Guide. Pay particular attention to the sections at the beginning that introduce the data and highlight its distinctive characteristics.

Research Articles

  • Campbell, Cameron and James Lee. 2002 (publ. 2006). “State views and local views of population: Linking and comparing genealogies and household registers in Liaoning, 1749-1909.” History and Computing. 14(1+2):9-29.  http://papers.ccpr.ucla.edu/papers/PWP-CCPR-2004-025/PWP-CCPR-2004-025.pdf
  • Bengtsson, Tommy, Cameron Campbell, James Lee, et al. 2004.  Life Under Pressure: Mortality and Living Standards in Europe and Asia, 1700-1900. MIT Press.  Published in Chinese as 托米·本特森,康文林,李中清等. 2008. 压力下的生活:1700~1900年欧洲与亚洲的死亡率和生活水平. 北京: 社会科学文献出版社. Translated by 李霞 and 李恭忠.  Appendix A.
  • Campbell, Cameron and James Z. Lee. 2011. “Kinship and the Long-Term Persistence of Inequality in Liaoning, China, 1749-2005.” Chinese Sociological Review. 44(1):71-104.  http://www.ncbi.nlm.nih.gov/pubmed/23596557

Review Articles

  • 康文林 (Cameron Campbell).  2012.  “历史人口学 (Historical Demography).”  Chapter 8 in 梁在编 (Zai Liang ed.) 人口学 (Demography).   北京:人民大学出版社 (Beijing: Renmin University Press), 233-265.

Select one or two of the following research articles based on your own interests (or another published article that uses the CMGPD), and read before the workshop starts

  • CHEN Shuang, James Lee, and Cameron Campbell. 2010. “Wealth stratification and reproduction in Northeast China, 1866-1907.” History of the Family. 15:386-412.  http://www.ncbi.nlm.nih.gov/pubmed/21127716
  • Bengtsson, Tommy, Cameron Campbell, James Lee, et al. 2004.  Life Under Pressure: Mortality and Living Standards in Europe and Asia, 1700-1900. MIT Press.  Published in Chinese as 托米·本特森,康文林,李中清等. 2008. 压力下的生活:1700~1900年欧洲与亚洲的死亡率和生活水平. 北京: 社会科学文献出版社. Translated by 李霞 and 李恭忠.  Chapter 10.
  • Wang Feng, Cameron Campbell, and James Z. Lee. 2010. “Agency, Hierarchies, and Reproduction in Northeastern China, 1789 to 1840.” Chapter 11 in Noriko Tsuya, Wang Feng, George Alter, James Z. Lee et al. Prudence and Pressure: Reproduction and Human Agency in Europe and Asia, 1700-1900. MIT Press, 287-316.
  • Chen Shuang, Cameron Campbell, and James Z. Lee.  Forthcoming.  “Categorical Inequality and Gender Difference: Marriage and Remarriage in Northeast China, 1749-1912.”  Chapter 11 in Lundh, Christer, Satomi Kurosu, et al. Similarity in Difference.

Software

If you are not familiar with STATA, prepare for the workshop by reviewing as many of the materials for learning and using STATA at UCLA IDRE as possible. You are also strongly encouraged to watch video tutorials at the STATA website. Ideally, by the time you arrive at the workshop, you should already be able to  carry out very basic operations in STATA such as loading and saving files, creating tabulations and so forth. Do try to download the CMGPD-SC or CMGPD-LN and make sure you know how to load them and carry out very simple operations.

Recommended Reading

  • As much of the User Guides and Training Guide as you can.
  • 定宜庄, 郭松义, 李中清, 康文林. 2004. 辽东移民中的旗人社会.  上海:上海社会科学出版社.
  • Lee, James and Cameron Campbell. 1997. Fate and Fortune in Rural China: Social Organization and Population Behavior in Liaoning, 1774-1873. Cambridge University Press.
  • 李中清,王丰.  2000.  人类的四分之一: 马尔萨斯的神话与中国的现实:1700-2000。  三联·哈佛燕京学术丛书。(English: Lee, James and Wang Feng.  1999.  One Quarter of Humanity: Malthusian Mythology and Chinese Reality, 1700-2000.)
  • Bengtsson, Tommy, Cameron Campbell, James Lee, et al. 2004.  Life Under Pressure: Mortality and Living Standards in Europe and Asia, 1700-1900. MIT Press.  Published in Chinese as 托米·本特森,康文林,李中清等. 2008. 压力下的生活:1700~1900年欧洲与亚洲的死亡率和生活水平. 北京: 社会科学文献出版社. Translated by 李霞 and 李恭忠.

Tentative Schedule (at Onedrive)

Acknowledgements

Preparation of the CMGPD-LN and accompanying documentation for public release via ICPSR DSDR was supported by NICHD R01 HD057175-01A1 “Multi-Generation Family and Life History Panel Dataset” with funds from the American Recovery and Reinvestment Act.

Preparation of the CMGPD-SC and accompanying documentation for public release via ICPSR DSDR was supported by NICHHD R01 HD070985-01 “Multi-generational Demographic and Landholding Data: CMGPD-SC Public Release.”

The CMGPD summer workshops in Shanghai have been supported by Shanghai Jiaotong University, the School of Humanities, the Department of History, and the Center for the Society and History of Northeast China.  We are also grateful to staff at a variety of campus units at SJTU for their logistical support.

More data doesn’t automatically lead to deeper understanding…

Finally, someone has very publicly thrown cold water on the wild claims made for the potential of ‘big data’. I like the title: “Why Big Data is Not Truth.”

It seems like every week now, I hear or read about someone in the news, typically an engineer or a computer scientist but very rarely a social scientist, breathlessly extolling the potential of ‘big data’ to yield transformative insights into social phenomena or individual behavior.  Almost inevitably this is illustrated with an utterly banal example of a finding, usually fit for nothing more than a cocktail party conversation, like perhaps people with small heads (as inferred from the sizes of the hats they buy) consume unusually large numbers of mangoes on Tuesdays and Thursdays.  That is a made up example, but to me is representative of the sorts of trivial and atheoretical ‘findings’ that too often are hauled out in puff pieces about the golden world of opportunity offered by big data.  The banality of these ‘findings’ illustrate the fundamental challenge that we face when we seek insight into underlying processes or mechanisms from observational data on people: describing a relationship is not the same as understanding it, or explaining it.

Correlation is not causality, and the problem doesn’t disappear no matter how much data we throw at it.  Whether a dataset contains one thousand records with one hundred variables, or one trillion records with one million variables, if it is observational data collected ‘in the wild’ or via a survey, any association observed in it is still just an empirical finding, albeit a potentially important one, until it is replicated in different settings with different data, and has a credible explanation.  A larger dataset or more variables don’t magically compensate for the fact that the data is based on observation, as opposed to generated by a controlled experiment with random assignment to treatment and control group.  If we’re lucky, there may be something in the data that can be thought of as an exogenous shock experienced by a random subset of the subjects, in which case differences between subjects who experienced the shock and those who didn’t may be interpreted as a genuine effect of the shock.

Lest anyone accuse me of being prejudiced against large datasets with many variables, let me be the first to say that some of my best friends are large datasets.  Indeed, for the last twenty years, I have helped create large historical datasets, analyze them, and release them to the public in the hope that others will be able to find applications for them that I could never imagine.  We have created datasets that record people who lived in China in the eighteenth and nineteenth centuries from birth to death, recording at regular intervals their social and economic status, their household and community context, and their demographic behavior and socioeconomic attainment.  I will probably continue helping to compile and analyze such datasets for the rest of my career, because that is how I roll, and because no one has showed up at my doorstep with a suitcase full of cash that would be mine if only I would join them on some sort of outlandish caper like you would expect in a Ross Thomas novel.

It is this experience with large datasets that has made me wary of the more extravagant claims for big data.  My collaborators and I have learned a great deal about life in the past in China, and about demographic behavior in general, from careful analysis of these data.  I want to continue compiling, analyzing and release these and other data.  I am sure that others who work with the data we have publicly released will make even more spectacular and important discoveries, not just about China, but about human populations more generally.

All the effort we have expended in the construction and analysis of these large datasets has made me painfully aware of what it is realistic to hope for.  We can describe important empirical regularities in great detail.  Many of these are of considerable interest in their own right, even if we can only suggest possible explanations for them, because they illuminate life in another time.  They are worth publishing in the same way that some fascinating but inexplicable astronomical phenomenon is worth publishing.

For some findings, an explanation is fairly straightforward and very credible.  We find that married women who had not yet borne children for their husbands, or had borne only daughters, had higher death rates than women who had borne sons.  This makes sense, since in the past in China, the primary responsibility of married women was to bear and then raise an heir for their husband’s family, and until they had at least borne a son, they were probably on a sort of probationary status, with limited access to family resources.  Once they had borne a son, they were probably fully enfranchised members of their husband’s household.  And we find that death rates rose and birth rates fell when grain prices were high, presumably because of economic adversity.

If we’re lucky, we find something that may have some relevance for the contemporary era.  For example, we found that babies born soon after their elder siblings (within 24 months) had elevated death rates in old age.  We speculated that this reflected the effects of maternal depletion on the newborn.  Linked to contemporary results on apparently adverse short-term consequences of a short preceding birth interval, perhaps this might tell us something important about human physiology.

But we also find perplexing results that are robust to alternative specifications and persist no matter what subset of the data we look at, but we can’t explain.  We find that high status males actually had higher death rates than other males.  We don’t know why, and can only speculate.  Perhaps their status and wealth allowed them to make what our son’s elementary schools refers to as ‘bad choices’: maybe they squandered their money on debauchery in Shenyang (at the time, Fengtian) and died early as a result of liver failure or tertiary syphilis.  We just don’t know.

More relevant to my rant, we periodically observe statistically significant associations, some of them quite fascinating, that disappear when we expand the dataset, or use a different subset of the data, or make slight modifications to our model.  If I had a dime for every association like this that we had come across, I’d be a rich man.  I suppose that if the result were interesting, we could come up with some post hoc rationalization of why it only appears in a specific subset of the dataset, when the model is specified in a very particular way, and try and publish it, but that sort of thing makes us queasy, because of our awareness that if you measure enough associations, the phenomenon of mass significance will lead at least some of them to appear to be significant, ever if they aren’t.  Again, we feel more comfortable making a claim if a result appears under multiple alternative specifications of the model, and across different subsets of the dataset.

I’m happy to continue plugging away with this sort of analysis indefinitely because I feel like an astronomer, except that instead of peering through a telescope at distant stars or galaxies and then trying to work backwards to develop an explanation for the regularities I observe, I am observing people in the past who I will never meet (unless I can buy a Tardis on Craigslist from a dissipated Time Lord whose alimony, child support, gambling debts and coke habit have made him desperate for money) and trying to discern and provide explanations for the regularities that I observe.  Some of the explanations or interpretations I come up with may be overturned as people uncover even better data or apply better methods, but I am pleased to have made some incremental contribution to our understanding of life in the past.

If the starry-eyed proselytizers of the salvation to be delivered to us by collection and analysis of ‘big data’ were willing to put down their Kool-Aid for a moment and limit themselves to a more cautious prediction that large quantities of data will allow us to observe empirical regularities and every once in a while come up with some genuine insight about the determinants of specific behaviors, I would be happy.  But too often, ‘big data’ proselytizers seem to imagine a future like the one in Isaac Asimov’s Foundation trilogy which I enjoyed so much in middle school, where simply by sifting through enough data, it is possible to predict not only individual behavior, but social change, decades or centuries in advance.  To put it mildly, they’re getting somewhat ahead of the field in terms of the optimism about the possibility to go from observation of individuals to predictions about their behavior.

To me, the biggest challenge to the use of ‘big data’ is some version of the phenomenon of ‘mass significance,’ which I referred to earlier in the context of our own experience.  If you have hundreds or thousands of variables that in reality have nothing to do with each other, and in fact are all series of random numbers generated by die rolls or some other process, but you calculate pairwise correlations between them, inevitably by luck of the draw some percentage of them will appear to have an association that is statistically significant at some threshold.  But if you collect the same data again in another time period, a completely different set of variables may be associated with each other.  In other words, what appears to be statistically significant association in data collected in one time period, will not have any association in a second time period.   Companies that find that people whose last names end in Y or who like to fill their cars with gas on Wednesdays also tend to be especially receptive to offered discounts on artichokes in one time period, may be disappointed in the next time period when they offer special deals on artichokes to such people.

Another problem, well known from previous analysis of observation data, is the possibility that observed relationships are not causal, but reflect complex influences of other variables that we don’t observe.  These might be variables that affect the chances of particular types of people being observed in our data, or variables that affect the values of the variables that we do observe.  Whether spurious relationships observed in data are the result of selection biases or the influence of an unobserved variable on the variables that we do observe, any relationship we do observe is unlikely to be causal, and changing behavior or making policy based on it may be premature, to say the least.  And in spite of the claims made for various approaches, I don’t there is any statistical voodoo that fixes the situation, and allows anyone to make solid claims of causality from purely observational data, except in very limited situations where at least one of the variables appears to be genuinely exogenous, in which case instrumental variables or other approaches may offer some insight.

This would all be fine if the goal of sifting through large amounts of data and identifying regularities was solely to develop a better understanding of the world, in the same way that astronomers sift through enormous amounts of data to development a steadily better understanding of the universe.  There would not be any harm if all we wanted to do was observe empirical regularities, hypothesize about relationships, and then wait to see if the next round of data collection confirmed our hypotheses.  I love doing that with historical data, since if I am wrong, no one is going to die because of some misguided policy that I propose, because everyone I study is already dead.  And of course I love doing that with contemporary data.  I don’t work that much with contemporary data, but others do, and we learn all sorts of remarkable things.

The scarier and probably more likely scenario, however, is that analysts will attempt to translate empirical regularities observed in ‘big data’ into government policy, company strategy, or individual behavior change without deep consideration of the possibility that the observed relationship is spurious, and perhaps can’t even be explained.  At best, this will lead to wasted effort, because the relationship of concern was spurious to begin with, and changing policy or changing behavior will have no effect.  In a worst case scenario, however, it could be destructive.

We already have many examples of policy or at least recommendations based solely on observational data had downright pernicious effects.  Hormone replacement therapy comes to mind.  Large observational studies based on what at the time was ‘big data’ led to a conclusion that hormonal replacement therapy would reduce the risk of breast cancer.  Eventually, better designed studies revealed that hormonal replacement therapy didn’t reduce the risk of breast cancer, and probably increased it.  That is but one example.  The health and public policy literature is littered with other examples of recommendations for diet change or other lifestyle change that were made based on survey studies or other observational studies, but were not borne out in later, more rigorous studies.

I am terrified that as we move forward into an era of ‘big data’, results from the correlations of millions of variables with each other will be reported uncritically, and we will be subjected to an endless stream of breathless reports based on observed but in the end spurious relationships, perhaps that people who eat mangoes on Tuesdays are more likely to be struck by lightning, or people who last names contain three or more vowels are more likely to buy yellow cars, etc.  If you think that is paranoia, just consider how many studies are already published every week that suggest that some slight diet modification raises or lowers the chances of some obscure cancer, based on observational data.

What is to be done?

I’m all in favor of continuing to collect and analyze data, including ‘big data’.  Every once in a while, a relationship may emerge that really matters.  And in many cases, even empirical regularities are useful and interesting to observe, even if we can’t explain such regularities.  Traffic planners may find it very useful to find out that a certain street is especially likely to be clogged with traffic on days of the month that are also prime numbers, even if they have no idea why.  Companies may find it very useful to know about patterns in customer behavior, even ones they can’t explain.

That said, we need to retain some healthy skepticism about the implications of associations observed in the analysis of ‘big data.’  Basically, we need to accept that ‘big data’ is not a magic bullet that makes more fundamental issues about inference vanish.  I’m doubtful based on the results of effort by social scientists that having orders of magnitudes more data will suddenly allow us to predict individual behavior with great specificity, or predict dramatic social changes  Life will probably remain stochastic at both the individual or aggregate level.  We may develop models that are useful for predicting the frequency of particular types of behavior in a sort of actuarial fashion, where we may predict that on average X percent of people with specified characteristics will do Y over some time period, but I doubt that we will ever have models that predict that individual i who has specified characteristics will do something on a specified date.  In other words, we may have lots of data that may be useful in actuarial calculations about average outcomes for aggregates of people, but I doubt we’ll get to the point where we can reliably predict the behavior of specific individuals in the short term.

The nightmare scenario is that a bad situation in which we already have almost weekly news reports based on dubious, never-replicated analyses suggesting that doing X increases our chances of suffering Y will turn into a worse situation where we have a daily or hourly stream of results claiming that individuals who do X raise their risk of experiencing Y, or that companies or cities, counties, or states that implement policy X will likely experience outcome Y.  Data mining may lead to a spasmodic, panicked, ever changing set of recommendations to individuals, companies, or governments, that eventually produces cynicism, and perhaps a backlash in which nobody believes anything based on empirical observation.

At the very least, this suggests a need for a very high bar for claiming that observed associations are suggestive of causal relationships that in turn lead to policy prescriptions, or recommendations for changes in behavior.  Ideally, associations will need to observed in multiple, independent datasets, and will need to have some sort of plausible account for the underlying mechanism or process generating the relationship.  In an ideal world, empirical observations of potentially important relationships would be followed up my more rigorous analysis like the ones much in vogue among economists that would try to establish causality, or at least provide some evidence for it.

This isn’t to say that we need to fetishize causality and turn their noses up at any analysis that doesn’t rely on instrumental variables, a natural experiment, or some sort of randomized field experiment.  Rather, the prescriptions for behavior or policy that we develop based on observations from big data have to be calibrated according to the import of the outcome, the plausibility of the proposed underlying mechanism or process, and the cost of the proposed change in behavior or policy.  If analysis of ‘big data’ suggests that we can people who avoid wearing plaid on Thursdays appear to have a lower risk of being bit by rabid squirrels, it wouldn’t cost much to avoid wearing plaid on Thursdays for a few months until the result is confirmed.  But if analysis of ‘big data’ suggests that carrying around bricks of depleted uranium substantially reduce our chances of being attacked by seagulls, we might want to hold off doing anything pending some careful thought and further investigation.

Along these lines, it would be a good idea for the engineers and computer scientists who are plunging ahead with the collection and analysis of ‘big data’ to learn from the experience of social scientists who have been grappling with the limitations of observational data for decades.  As Bismarck said, ““Fools learn from experience. I prefer to learn from the experience of others.”  Those who are now collecting and analyzing ‘big data’ should learn from the experience of social scientists, not by reinventing the wheel and repeating the same mistakes social scientists have made for the last few decades.  The most important lesson is perhaps to be humble, and be aware of the limitations of observational data.  Perhaps we should invite computer scientists or engineers working with social data into our research methods classes, not to teach them new statistical techniques, but to teach them the fundamentals of study design, like the difference between experimental and observational designs, the circumstances under which an inference of causality may be justified, and the dangers posed by selection processes and omitted variables.

Conversely, as social scientists, we need to incorporate training in the management of large and complex datasets into the undergraduate and graduate social science curriculum.  Right now, our quantitative training typically provides students with predigested datasets that don’t require any manipulation, and then teaches them a variety of flavors of regression, some very exotic, that they can use to estimate models on those datasets. We almost never offer systematic training to students in how to manipulate those datasets to create new variables.  And we almost never offer any systematic training in how to take ‘found’ data (perhaps the output from a web server log, or administrative data) and suck it into STATA or some other program, and organize it.

As a result, we have students who know how to take a dataset that someone hands them and run a five stage most squares regression with a cubic spline for age, income instrumented by the level of solar background radiation, and a Heckman sampling correction.  But if you hand them a more complex longitudinal dataset like CMGPD that may require some simple manipulation to create variables measuring household or community characteristics to include in a discrete-time event history analysis via a simple logistic regression, they’re stuck.  In the years I spent teaching regression, it was clear to me that for many students, the biggest problem was not in choosing variables, estimating a regression, and interpreting results, but in preparing the data for the estimation.

There are already many excellent social scientists who already create and work with absolutely ginormous datasets, I would speculate that when it comes to the techniques for managing those large and complex datasets, most of them are either self-taught, collaborating with computer scientists with expertise in database management, or came in from other fields.  But we can’t rely on graduate students or faculty with relevant skills for manipulating large datasets to keep falling from the sky the way they have in the past.  We have to produce them systematically.

Now to put my tinfoil hat on, another serious concern I have about ‘big data’ is that it may not turn out to be that useful in terms of improving our understanding of processes and mechanisms by which individual context and characteristics affect individual behavior or outcomes, but will likely prove to be a goldmine for post hoc extraction of information about individuals’ past behavior that could be used to embarrass or blackmail them.  In other words, it may turn out that big data leads to little in the way of important, fundamental insights about human behavior, but will facilitate the creation of individual dossiers full of tidbits that can be hauled out to embarrass people whenever they seek political office, blow the whistle on their employer, or who knows what.  Various totalitarian states collected enormous amounts of information on their citizens via surveillance and the reports of informers.  I’m not sure that the data ever allowed any of the states to predict the individual behavior or social change.  If the data could have been exploited to make accurate predictions about individuals or society itself, some of those totalitarian states might still be around.  What we learned however is that the information was less useful for prediction than for control.

Note: I have been going back and modifying this as I have had more thoughts, or received feedback.  An exchange with Mark Hayward was particularly inspiring because it drew attention to the need for social scientists to develop a response.

Opening old Excel files in STATA 12

I ran into some importing old Excel files into STATA 12.  Since I thought others would probably be encountering the same problem, I decided to write a blog post about it.

We’re getting ready to produce a draft release of our China Multigenerational Panel Dataset – Shuangcheng (CMGPD-SC) so that users can kick the tires and report problems before we submit a final version to ICPSR for dissemination there.

As part of the preparation, we wanted to take advantage of the new facility in Stata 12 that allows Excel files to be opened directly.  Our ‘raw’ data consist of Excel spreadsheets entered by our coders, one per register.  Registers are annual or triennial.  For our Liaoning dataset, we have 737 registers coded.  For Shuangcheng, we have 338.  Previously, our procedures for automating the import of the registers in Stata were clumsy, and rarely survived upgrades to Stata or Windows.  At one point we were using the odbc command to loop through and read all the registers, but that broke when we moved to computers that were running 64 bit windows.  Then we wrote a macro to loop through the Excel files and write them to tab-delimited text fields, which STATA could read.

Converting our programs to use import excel was fairly straightforward.  Basically it just mean substituting import excel for insheet.

When we began running the programs, however, STATA was reporting that it could not load files, and came back with an r(603).  I did notice it could open all .xlsx files, but had more trouble with .xls files.  I began to wonder if the problem was with older versions of Excel files.  Perhaps the import capability assumed a recent version of Excel.  I saved some of the files as .xlsx files and sure enough, STATA could read them.

At that  point, it became necessary to convert the thousand or so files that were in older versions of Excel to .xlsx files.  Opening them one by one and saving them to .xlsx would be impractical.

I poked around on the net, and found that Microsoft had an Office File Converter tool available for download.   Here is an introduction and here is the download.  The tool requires that the Microsoft Office Compatibility Pack be installed.  By modifying the ofc.ini file, and adding the name of a folder under [FoldersToConvert] it is possible to direct OFC to attempt to convert all the old .xls files it finds in a specified folder to .xlsx.

[FoldersToConvert]
fldr=C:UsersCameronDropboxSharedSkydriveCMGPD DataLN

Here is what my [ConversionInfo] section ended up looking like:

SourcePathTemplate=********
DestinationPathTemplate=*1*2*3*4*5*6*7*8Converted

I ran ofc and sure enough, it chugged through the files and converted them and placed them in a directory under the original folder that was called Converted.

Now Stata is happily chewing through the converted files.

 

 

 

First publication using the CMGPD-LN public release!

Congratulations to Wang Lei at the Chinese Academy of Social Sciences’ Institute of Labor and Population Economics!  Wang Lei has just published what we believe is the first publication using the public release of the CMGPD-LN that doesn’t have one of us as a co-author: http://www.cnki.com.cn/Article/CJFDTotal-RKJJ201302006.htm The paper is a study of bachelorhood in northeast China in the eighteenth and nineteenth centuries, taking advantage of the excellent data on marital status available in the CMGPD-LN. It appeared in 人口与经济 (Population and Economics), which is one of China’s major social science journals.

We all expect that this will be just the first of many publications by others that make use the CMGPD-LN.

Here is the full citation for anyone who is interested:

Wang Lei.  2013.  清代辽东旗人社会中的男性失婚问题研究-基于中国多世代人口数据库—辽宁部分( CMGPD-LN) (A Study of Males’ Out-of-marriage in Bannerman Society of East Liaoning in Qing Dynasty: Based on CMGPD-LN).  人口与经济 (Population and Economics).  2013(2):35-43.

And for anyone who is interested, here is a paper we published on male marriage, which Wang Lei was kind enough to cite: http://sjeas.skku.edu/upload/200905/17-42JamesLee-1.pdf

 

Summer 2013 China Multigenerational Panel Dataset Workshop at SJTU (English announcement)

Summer 2013 China Multigenerational Panel Dataset Workshop
Shanghai Jiaotong University
Minhang Campus
Shanghai, China

July 15-19, 2013

中文版

The Center for the History and Society of Northeast China at the Shanghai Jiaotong University School of Humanities will hold its third summer China Multigenerational Panel Data workshop from July 15 to July 19.

The workshop will focus on introducing the China Multigenerational Panel Datasets (CMGPD) as sources for the study of demography, stratification, and social and family history. These include the China Multigenerational Panel Dataset – Liaoning (CMGPD-LN) and the China Multigenerational Panel Dataset – Shuangcheng (CMGPD-SC).  The CMGPD-LN has already been released via the Inter-university Consortium for Political and Science Research.  Data and documentation are already available for download: http://www.icpsr.umich.edu/icpsrweb/CMGPD/. Chinese language documentation for the CMGPD-LN are available for download here.  Draft documentation for the CMGPD-SC are available for download here.

The CMGPD datasets have many unique features that make them useful not only for the study of Chinese population, social, and family history, but for the study of demographic, social and economic processes more generally.  Their features also make them useful as testbeds for researchers developing novel quantitative techniques.  The datasets are longitudinal, multi-generational, and structured at multiple levels, including the individual, the household, the kin group, the community, the administrative unit, and the region.

UCLA Professor of Sociology Cameron Campbell and Distinguished Professor and Dean of Humanities and Social Sciences at the Hong Kong University of Science and Technology James Lee will be primary lecturers.  Guest lecturers will include Yuxue Ren, Professor of History at Shanghai Jiaotong University; and Dong Hao, PhD student at the Hong Kong University of Science and Technology.

This class is intended to 1) introduce researchers to the CMGPD datasets and help them decide whether they may be useful in their own studies, and 2) give current users an opportunity to learn more about the origin and context of the data.   Researchers who have already started using the CMGPD-SC or CMGPD-LN are welcome to attend and take advantage of the opportunity to discuss any questions they may have with Lee, Campbell, and others who were involved in the creation of the dataset.

Lectures and discussion will focus on 1) the historical, social, economic and institutional context of the populations covered by the data, 2) key features of the data, and 3) potential applications.  Because we have already released a Training Guide that provides instruction on carrying out basic and advanced analysis with the data, this year’s workshop will not provide instruction in STATA, or have computer exercises.  There will be optional sessions to introduce the Training Guide and demonstrate basic procedures for downloading the data from the website and loading it into STATA.

At the end of the week, participants will be asked to make a brief presentation on their ideas for making use of the data.  If participants are already working with the CMGPD, they will be welcome to make brief presentations on their work with it.  There will not be any computer exercises.

If any non-Chinese speakers enroll, the lectures will be in English.  If the participants all speak Chinese, lectures may be in Chinese.  Discussion will be in English and Chinese.

The Shanghai Jiaotong University Center for the History and Society of Northeast China was established as a research unit by a collaboration of the Shanghai Jiaotong University (SJTU) School of the Humanities and the Hong Kong University of Science and Technology (HKUST) School of the Humanities and Social Sciences.

Datasets

China Multigenerational Panel Dataset – Liaoning (CMGPD-LN)

The CMGPD-LN is an important dataset for the study of China’s family, social and demographic history, and for the study of demography and stratification more generally. The dataset is suitable for application of a wide variety of statistical techniques that are commonly used in social demography for the analysis of longitudinal, individual-level data, and available in the most popular statistical software packages. The dataset is distinguished by its size, temporal depth, and richness of detail on family, household and kinship context.

The materials from which the dataset was constructed are Shengjing Imperial Household Agency household registers held in the Liaoning Provincial Archives. The registers are triennial. Altogether there are 3600 of them. We transcribed a subset of them to produce the CMGPD-LN, which spans 160 years from 1749 to 1909. At present, the dataset comprises 29 register series, and consists of 1,500,000 records that describe 260000 individuals over seven generations. The CMGPD-LN is accordingly an important resource for the study of historical demography, sociology, economics, and other fields.

The CMGPD-LN and associated English-language documentation are already available for download at ICPSR, following a free registration. Please visit the website: http://www.icpsr.umich.edu/cmgpd

China Multigenerational Panel Dataset – Shuangcheng (CMGPD-SC)

The CMGPD-SC covers communities of recent settlers in Shuangcheng, Heilongjiang in the last half of the nineteenth century and beginning of the twentieth. It contains 1.35 million records that describe 100,000 people. The registers cover descendants of urban migrants from Beijing and rural migrants from neighboring areas in northeast China who came to the area in the first half of the nineteenth century as part of a government organized effort to settle this largely vacant frontier region. One of the distinguishing features of this dataset is the availability of linked, individual-level landholding records for several points in time. The data also include a rich array of other indicators of household and family context and socioeconomic status. We anticipate formal public release of the dataset via ICPSR in 2013 or 2014. We will provide participants in the summer class with access to drafts of the release and documentation.

Information

Dates

Monday, July 15, 2013 to Friday, July 19, 2013

Location
Shanghai Jiaotong University School of Humanities (SJTU Minhang Campus, Shanghai)
Application deadline

May 25, 2013

See link below to download application

Application procedure

Please send your personal statement, curriculum vitae, and application form as attachments to chinanortheast@gmail.com.  We will have an English language application form available soon.

Applications from faculty, postdoctoral researchers and graduate students are welcome. Applications from graduating college seniors will also be considered if they have already been accepted into a graduate program beginning fall 2013.  In that case, the application should include a copy of their graduate school acceptance. Any other interested parties should contact our staff at chinanortheast@gmail.com before applying to see if they will be considered.

Participants should be able to speak or read Chinese or English.  No prior experience in statistics, demography, or Chinese history is required.  Applicants must explain the reasons for their interest in the data in their application, and should demonstrate that they have background, experience or interests that in some way are relevant.

Participants will be offered free housing in graduate student dormitories at SJTU.  Participants who want other accommodations will have to arrange them on their own and will be responsible for all associated costs.  Participants should bring their own computer.  Students are responsible for travel and local expenses.  At present we expect to be able to accommodate 25-30 participants.

Links

Required Reading

Please complete as much of the required reading as possible before the workshop begins.  The highest priority are the assigned readings in the CMGPD-LN and CMGPD-SC User Guides.  Once these are complete

Documentation

  • CMGPD-LN User Guide.  English pages 1-54, 90-96 or Chinese pages 13-64, 96-101.  Skim the descriptions of variables to look for ones that may be relevant to your research.
  • CMGPD-SC User Guide.  English pages 1-47.
  • CMGPD Training Guide.  Please review slides 1-40.  Users who have experience or training in statistics should skim the remainder of the training guide and review the examples of the use of the guide.

Research Articles

  • Campbell, Cameron and James Lee. 2002 (publ. 2006). “State views and local views of population: Linking and comparing genealogies and household registers in Liaoning, 1749-1909.” History and Computing. 14(1+2):9-29.  http://papers.ccpr.ucla.edu/papers/PWP-CCPR-2004-025/PWP-CCPR-2004-025.pdf
  • Bengtsson, Tommy, Cameron Campbell, James Lee, et al. 2004.  Life Under Pressure: Mortality and Living Standards in Europe and Asia, 1700-1900. MIT Press.  Published in Chinese as 托米·本特森,康文林,李中清等. 2008. 压力下的生活:1700~1900年欧洲与亚洲的死亡率和生活水平. 北京: 社会科学文献出版社. Translated by 李霞 and 李恭忠.  Appendix A.
  • Campbell, Cameron and James Z. Lee. 2011. “Kinship and the Long-Term Persistence of Inequality in Liaoning, China, 1749-2005.” Chinese Sociological Review. 44(1):71-104.  http://www.ncbi.nlm.nih.gov/pubmed/23596557

Review Articles

  • 康文林 (Cameron Campbell).  2012.  “历史人口学 (Historical Demography).”  Chapter 8 in 梁在编 (Zai Liang ed.) 人口学 (Demography).   北京:人民大学出版社 (Beijing: Renmin University Press), 233-265.

Select one or two of the following research articles based on your own interests (or another published article that uses the CMGPD), and read before the workshop starts

  • CHEN Shuang, James Lee, and Cameron Campbell. 2010. “Wealth stratification and reproduction in Northeast China, 1866-1907.” History of the Family. 15:386-412.  http://www.ncbi.nlm.nih.gov/pubmed/21127716
  • Bengtsson, Tommy, Cameron Campbell, James Lee, et al. 2004.  Life Under Pressure: Mortality and Living Standards in Europe and Asia, 1700-1900. MIT Press.  Published in Chinese as 托米·本特森,康文林,李中清等. 2008. 压力下的生活:1700~1900年欧洲与亚洲的死亡率和生活水平. 北京: 社会科学文献出版社. Translated by 李霞 and 李恭忠.  Chapter 10.
  • Wang Feng, Cameron Campbell, and James Z. Lee. 2010. “Agency, Hierarchies, and Reproduction in Northeastern China, 1789 to 1840.” Chapter 11 in Noriko Tsuya, Wang Feng, George Alter, James Z. Lee et al. Prudence and Pressure: Reproduction and Human Agency in Europe and Asia, 1700-1900. MIT Press, 287-316.
  • Chen Shuang, Cameron Campbell, and James Z. Lee.  Forthcoming.  “Categorical Inequality and Gender Difference: Marriage and Remarriage in Northeast China, 1749-1912.”  Chapter 11 in Lundh, Christer, Satomi Kurosu, et al. Similarity in Difference.

Recommended Reading

  • As much of the User Guides and Training Guide as you can.
  • 定宜庄, 郭松义, 李中清, 康文林. 2004. 辽东移民中的旗人社会.  上海:上海社会科学出版社.
  • Lee, James and Cameron Campbell. 1997. Fate and Fortune in Rural China: Social Organization and Population Behavior in Liaoning, 1774-1873. Cambridge University Press.
  • 李中清,王丰.  2000.  人类的四分之一: 马尔萨斯的神话与中国的现实:1700-2000。  三联·哈佛燕京学术丛书。(English: Lee, James and Wang Feng.  1999.  One Quarter of Humanity: Malthusian Mythology and Chinese Reality, 1700-2000.)
  • Bengtsson, Tommy, Cameron Campbell, James Lee, et al. 2004.  Life Under Pressure: Mortality and Living Standards in Europe and Asia, 1700-1900. MIT Press.  Published in Chinese as 托米·本特森,康文林,李中清等. 2008. 压力下的生活:1700~1900年欧洲与亚洲的死亡率和生活水平. 北京: 社会科学文献出版社. Translated by 李霞 and 李恭忠.

Tentative schedule

Acknowledgements

Preparation of the CMGPD-LN and accompanying documentation for public release via ICPSR DSDR was supported by NICHD R01 HD057175-01A1 “Multi-Generation Family and Life History Panel Dataset” with funds from the American Recovery and Reinvestment Act.

Preparation of the CMGPD-SC and accompanying documentation for public release via ICPSR DSDR was supported by NICHHD R01 HD070985-01 “Multi-generational Demographic and Landholding Data: CMGPD-SC Public Release.”

The CMGPD summer workshops in Shanghai have been supported by Shanghai Jiaotong University, the School of Humanities, the Department of History, and the Center for the Society and History of Northeast China.  We are also grateful to staff at a variety of campus units at SJTU for their logistical support.

 

Recoding variables at IPUMS

For my social demography class at UCLA, I have the students visit the IPUMS website to do basic analysis. I have been using SnagIt to prepare screen-capture videos demonstrating various capabilities at the site. This one introduces recoding variables. You will probably want to watch it full frame in order to make out the text. I intended this for students enrolled in my class, but hope it is useful for anyone who stumbles across it.