Age is one of the most important attributes in one user’s profile. Age detection has many applications like personalized search, targeted advertisement and recommendation. Current research has uncovered the relationship between the use of western language and social identities to some extents. However, the age detection problem for Chinese users is so far unexplored. Due to the cultural and societal difference, some well known features in English may not be applicable to the Chinese users. For example, while the frequency of capitalized letter in English has proved to be a good feature, Chinese users do not have such patterns. Moreover, Chinese has its own characteristics such as rich emoticons, complex syntax and unique lexicon structures. Hence age detection for Chinese users is a new big challenge.
In this paper, we present our age detection study on a corpus of microblogs from 3200 users in Sina Weibo. We construct three types of Chinese language patterns, including stylistic, lexical, and syntactic features, and then investigate their effects on age prediction. We find a number of interesting language patterns: (1) there is a significant topic divergence among Chinese people in various age groups, (2) the young people are open and easy to accept new slangs from the internet or foreign languages, and (3) the young adult people exhibit distinguished syntactic structures from all other people. Our best result reaches an accuracy of 88% when classifying users into four age groups.