Abstract

Mixing languages together in text and in talking is a major feature in non-English languages in developing countries. This mixed grammar is also emerging in SMS, Facebook communication, searching the web and any future attempts also may increase the footprint of such a mixed language knowledge base. Traditional information retrieval (IR) and cross-language information retrieval (CLIR) systems do not exploit this natural human tendency as the underlying assumption is that user query is always monolingual. Accordingly, the majority of the text collections are either monolingual or multilingual. This paper explores the trends of mixed-language querying and writing. It also shows how the corpus is validated statistically and how an Arabic lexicon can be extracted using co-occurrence statistics. Results showed that the distribution of frequencies of words in the corpus is very skewed the vocabulary growth is a good fit. The results of how to handle mixed queries are also summarised.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call