How would you go about finding duplicates questions on Stack Overflow? More generally, given access to a data-feed of the Stack Overflow material (titles, questions, answers, comments), what would be your design of a program which would find the 'most related' questions. "Most related" can be alternatively measured as 'most likely to be a duplicate' as well as 'most likely to be on the same topic'. These are co-related, but not identical.
Further, what programming language would you choose for this task? In particular, if your choice of language does not essentially boil down to 'because I know there are libraries to do most of this', then why that language?
Finally, do you think your solution would be fast enough to be used as a filter on Stack Overflow questions, to help reduce the amount of duplicate questions?