Cross-Language Information Retrieval has queries in one language and documents in another. Crossing the langauge barrier is an integral part of the task. As we discussed in class, this is normally accomplished using three main methods:
Patapsco is a python framework for CLIR experiments. You can find the GitHub repo for the project here. This is the paper describing it. This is a nice tool for running CLIR pipelines as you can specify in parameters in a yaml file. This makes it easy to swap out different portions of the pipeline and run experiments using this.
For this Homework Assignment, we will be working from the demo colab notebook. You are free to clone this notebook and run in another environment. However, running it in your browser should be sufficient for this assignment. Run everything up until PSQ portion of the notebook. This should create a directory runs/query-translation/. If you rerun portions, you will need to change the run name, or delete the previous run. It won’t allow you to overwrite a run. You should be able to find a scores.txt file after a run. As this is a very small, demo collection, we will not look at recall, but instead report scores on MAP and NDCG Prime.
The first run was using human translated queries. We will now look at how different preprocessing steps impact the translation. Note that as we discussed in class, modern neural methods for translation frequently break. Using human translations is normally not possible at inference time, but is useful for evaluation of methods.
In the baseline run, both the documents and the queries are lowercased. First we will look at the impact of not processing queries and documents in the same way. Change the queries to use truecase instead of lowercase. The run should fail. Different preprocessing is such an issue for IR, that automatic checks like this are necessary. Also change the document preprocessing to use truecased. Report MAP and NDCG Prime.
Change the tokenizer to stanza. Rerun the system using stanza and truecased.
Probabilistic Structured Queries (PSQ) (Darwish and Oard, 2003) rely on statistical machine translation to create translation tables of queries. The colab notebook already has a pretrained version called: zho_eng_clean_reduced_pdt.dict
We are interested in the impact of translation quality on translations on PSQ. Please train another translation table following the example in GitHub: Running PSQ Using Patapsco.
This process can be slow (relying on Giza++). Feel free to use a smaller bitext (say 20k lines) even though the performance will not be as good as the baseline experiment. You are free to use any publicly available Chinese English dataset you would like. After creating a new translation table for PSQ, rerun the system using this.
There are a lot of ways that this could be improved. This could be from improved tokenization schemes, better translations, segmentation, etc. Propose and implement 2 methods.
Please write about 500 words on the assignment. Include disucssion about how and why the method fails, systematic problems with query translation, additional potential fixes that may help, etc. Also, please discuss how you expect this to behave on at least two other languages? How will language family, morphology, script, etc. affect results?