Wednesday, May 13, 2009

Creating a corpus of first year university academic writing

Preparing my corpus

This past Friday I presented at the British Columbia Teachers of English as an Additional Language (BC TEAL) conference. The topic was “Comparing non-native and native English undergraduate vocabulary in writing”. The first part of my talk dealt with the creation of the corpus that I used for my analyses.

The focus of my presentation was on what lexical frequency based analyses reveal about active vocabulary breadth of knowledge in novice native English speaking (NS) and non-native English speaking (NNES) undergraduate writing. By novice, I mean I am looking at first year students at the University of Calgary who have not yet passed the Effective Writing Proficiency Requirement (

In order to investigate this question, I’ve gathered a corpus of writing, which I’m calling the Effective Writing Corpus. The writing samples in the corpus come from the Alberta Universities’ Writing Competence Test, also called, the Effective Writing Test (EWT). The EWT is a test administered to first year university students at the University of Calgary, the University of Lethbridge and Athabasca University. The test is designed to look for university level writing competence, and it is administered to all students who are entering university with less than a score of 75% on the English Language Arts 30-1 Diploma exam, or less than a blended grade of 80% on the blended grade of the diploma exam and the class score (50-50 split). Students who enter university with higher scores than these are exempt from the EWT. There are also other ways students are exempt, such as achieving a score of B- in a first year English course ( In total the test is sat approximately 2250 times each year, with some of those sittings being repeated attempts to pass the test by the same students.

The EWT itself takes the form of a persuasive or expository essay answering one of four questions. These questions tap on a general body of knowledge, and no specialized knowledge is needed to answer the questions. An example of a question on the EWT might be along the lines of “Should the Government of Alberta institute mandatory physical education courses from kindergarten to Grade 12?” The essay answer written by the student should be around 400 words, and the markers are looking for university level writing competence. Some of the key points markers pay attention to include logical arguments, clear organizations, well developed paragraphs, well constructed sentences, accurate word use, and correct grammar, spelling and punctuation. English language dictionaries are permitted in the test, and the students have two and a half hours to complete their essays.

The corpus I am building focuses on the academic year of 2003/2004. Out of the approximately 2250 tests that were written that year, 561 NS students and 184 NNES gave permission for their tests to be used for research purposes. This is approximately 33% of the total amount of tests written in that year. Out of the NNES papers, 40 different languages were represented in the raw data. Out of these 40 languages, by far the greatest numbers of students had Chinese, Arabic, Spanish, and Punjabi as their first languages. Chinese was the largest group of all NNES students.

Breaking the students down into their constituent first languages reveals some interesting results in terms of performance on the EWT. 70% of all NS students who write the EWT pass on their first attempt. If we look at all the NNES students, except for those whose first language is of East Asian origin, 47% of NNES students (minus East Asian languages) pass the EWT on their first attempt. Finally, if we look only at students with a first language originating in East Asia, only 23% of those students pass the EWT on their first attempt. It is also interesting to note, that at the end of the academic year, there are about 700 students who still have not completed the Effective Writing Requirement. out of those 700 students, approximately 90% (630) are NNES. If approximately 75% of NNES students are of East Asian origin, that means about 470 NNES students of East Asian origin are still struggling to complete the Effective Writing Requirement by the end of the school year, and face being blocked from registering in their second year classes.

It is because of the struggles NNES whose first language is of East Asian origin face in passing the EWT that I have decided to focus on this group of students for my study. By focusing on this group of students, 79% of the papers in the NNES sub-corpus are written by students with Chinese (Cantonese and Mandarin) as their first language. The rest of the NNES sub-corpus is made up of Korean, Vietnamese, Japanese and Laotian. The NNES students have varying lengths of residence in Canada, ranging from 14+ years, 10-13 years, 7-9 years, 4-6 years, and less than 3 years. Each of these cohorts contains between 11 and 20 students.

The two sub-corpora (NS and NNES) also revealed some differences in faculty enrolment and topic choice between the two groups of students. The top three faculties of enrolment for NS students at the time of writing were Communication and Culture, Science, and Social Science. The top three faculties of enrolment for NNES students at the time of writing were Social Science, Science and Engineering. The top three topics for NS students were Physical Education, Computers, and Urban Growth. The top three topics for NNES students were Physical Education, Computers, and Being Ready for the Workforce.

Before I could begin my analysis, I had to prepare the raw data. The EWT is a handwritten test in official University of Calgary exam booklets. All the tests were typed and converted into text files for computer storage and analysis. As the papers were being typed, they were corrected for spelling, with spelling errors being noted on the original raw data. Proper nouns, such as of people and places, were recategorized into the first one thousand most frequent words of English. Semantic and derivational errors were also recategorized into the first one thousand most frequent words of English. Doing this prepared the data for linguistics analysis using various tools found on the Compleat Lexical Tutor website (Cobb, 2009).

1 comment:

jessica said...

I have been visiting various blogs for my dissertation research. I have found your blog to be quite useful. Keep updating your blog with valuable information... Regards