Pacific University
Shereen Khoja Home
Research Interests
Fall Schedule
Office Hours

T 1:00-2:00pm
W 10:00-12:00pm
Th 1:00-2:00pm
or by appointment

Contact Info
(503) 352-2008
Strain 203C

2043 College Way
Forest Grove
OR 97116

Map it

Shereen Khoja

My primary area of research is Arabic Computational Linguistics. Specifically:

Stemming: Details about the stemmer I have developed for Arabic.
Tagging: Details about the Part-Of-Speech (POS) tagger I am developing for Arabic.
Corpora: Details about the Arabic corpora I am using. I have manually tagged 50,000 words of Arabic newspaper text with the basic tags (noun, verb, particle). I have also tagged 1,700 words with more detailed tags (i.e. singular, masculine, definite common noun). These are available for research purposes. Please e-mail me if you would like a copy of them.
Publications: I have included a couple of my publications here that can be viewed or downloaded. 


Arabic stemmer

I've developed a stemmer for Arabic that is fast and highly accurate. Before I go into any details of the stemmer, let me first explain some of the basics.

What is stemming?

Stemming is the process of removing any affixes from words, and reducing these words to their roots. For example, stemming the English word computing produces the root comput. This is the same root produced by the word computation.

What is stemming useful for?

After reducing words to their roots, these roots can be used in compression, spell checking, text searching, and text analysis.

Compression To reduce the size of documents, large words could be stored in their root form. A small program would then be used to return the document to its original form when opened. It would do this by using context and grammar to determine the original form of the word.

Spell checking Instead of searching for a complete word in a dictionary, only the root would be searched for. This reduces the size of the dictionary.

Text searching The best example of this is web search engines. Searching for the root of a word gives a wider search than trying to find an exact match.

Text Analysis For example in statistical text analysis, stemming helps in mapping grammatical variations of a word to instances of the same term.

How is the stemmer implemented?

The first thing the stemmer does is remove the longest suffix and the longest prefix. It then matches the remaining word with the verbal and noun patterns, to extract the root. The stemmer has been developed in both C++ and Java. As with all natural languages, there are variations to the general rules of the language. The stemmer must deal with these problems, and produce the correct root.

What problems might the stemmer face?

If the root contains a weak letter (i.e. alif, waw or yah), the form of this letter may change during derivation. To deal with this, the stemmer must check to see if the weak letter is in the correct form. If it is not, the stemmer produces the correct form of this weak letter, which then gives the correct form of the root.

  • Some words do not have roots. For example the Arabic equivalents of we, after, under and so on. If the stemmer comes across any of these words, it does nothing.
  • Sometimes a root letter is deleted during derivation. This is especially true of roots that have duplicate letters (i.e. the last two letters are the same). The stemmer can detect this, and return the letter that was removed. - If a root contains a hamza, this hamza could change form during derivation. The stemmer detects this, and returns the original form of this hamza.

The Arabic stemmer has been used as part of an Information Retrieval system developed at the University of Massachusetts for the TREC-10 cross-lingual track in 2001. The authors report that although the stemmer produced many mistakes, it improved the performance of their system immensely. The title of their paper is "Arabic Information Retrieval at Umass in TREC-10" by L. S. Larkey, M.E. Connell, University of Massachusetts.


You can download a Java version of the stemmer here.

[back to top]

Grammatical Tagging

What is tagging useful for?

Grammatical tagging (or part-of-speech tagging) is the process of assigning grammatical part of speech tags to words based on their context. This process has been automated for English and many other Western languages, and also some Asian languages with various accuracy rates ranging between 95-98%. As far as I know, a POS tagger has yet to be developed for Arabic, which is why I am developing one myself. Please see my publications for more information on my Arabic POS tagger.

I will now give a brief summary on what POS tagging is useful for. This will be followed by a list of POS taggers that I have found on the web.

What is tagging useful for?

A tagged corpus is more useful than an untagged corpus because there is more information there than in the raw text alone. Once a corpus is tagged, it can be used to extract information from the corpus. This can then be used for creating dictionaries and grammars of a language using real language data. Tagged corpora are also useful for detailed quantitative analysis of text.

List of part-of-speech taggers

Many POS taggers have been developed for English and other languages. Many techniques have been used to develop these taggers, and some taggers are language independent. I have compiled a list of some of the taggers that are available on the web.

[back to top]


Although corpora are widely available for English (some for free), there is very little available for the Arabic language. Also, although some of these corpora are marked-up with XML or SGML tags, none of them are POS tagged.

I have manually tagged Arabic newspaper text that I can provide freely for research purposes. I have two corpora.

  • 50,000 words of tagged newspaper text. The words are as being either definite or indefinite noun, verb, particle, punctuation or number.

  • 1,700 words of tagged newspaper text. The words are tagged with more detailed tags using gender, number and so on. Details of the tagset used can be found in this paper.

If you would like copies of these corpora, please e-mail me and let me know which encoding you would prefer. I could provide with the corpora in Unicode, CP1256 (Arabic Windows) or IS0 8859-6.

[back to top]


An Arabic Tagset for the Morphosyntactic Tagging of Arabic by Shereen Khoja, Roger Garside and Gerry Knowles. Paper presented at Corpus Linguistics 2001, Lancaster University, Lancaster, UK, March 2001, and to appear in a book entitled "A Rainbow of Corpora: Corpus Linguistics and the Languages of the World", edited by Andrew Wilson, Paul Rayson, and Tony McEnery; Lincom-Europa, Munich.

APT: Arabic Part-of-speech Tagger by Shereen Khoja. Proceedings of the Student Workshop at the Second Meeting of the North American Chapter of the Association for Computational Linguistics (NAACL2001), Carnegie Mellon University, Pittsburgh, Pennsylvania. June 2001.

Automatic Tagging of an Arabic Corpus Using APT by Shereen Khoja and Roger Garside. Paper presented at the Arabic Linguistic Symposium (ALS), University of Utah, Salt Lake City, Utah. March 2001.

[back to top]