3

How would you go about finding duplicates questions on Stack Overflow? More generally, given access to a data-feed of the Stack Overflow material (titles, questions, answers, comments), what would be your design of a program which would find the 'most related' questions. "Most related" can be alternatively measured as 'most likely to be a duplicate' as well as 'most likely to be on the same topic'. These are co-related, but not identical.

Further, what programming language would you choose for this task? In particular, if your choice of language does not essentially boil down to 'because I know there are libraries to do most of this', then why that language?

Finally, do you think your solution would be fast enough to be used as a filter on Stack Overflow questions, to help reduce the amount of duplicate questions?

7
  • SO tries to do that already. When you type in your question title, it tries to find questions with similar titles. How well it works, well... that's another matter.
    – FrustratedWithFormsDesigner
    Commented Jun 18, 2010 at 20:39
  • 5
    Hmmm.. this probably should have stayed on SO.
    – Jon Seigel
    Commented Jun 18, 2010 at 20:45
  • 1
    @jon really? it's so broad and speculative as to be NARQ from my perspective.
    – Jeff Atwood StaffMod
    Commented Jun 18, 2010 at 21:58
  • @Jeff: Possibly, but I'm pretty sure it doesn't belong on Meta.
    – Jon Seigel
    Commented Jun 18, 2010 at 22:05
  • I was really rather careful in how I phrased my request so that it belonged on SO rather than on meta. I did not ask for a 'better' way to find duplicates or complain about duplicates. I asked for a design for how to achieve this - a programming question. Commented Jun 19, 2010 at 1:17
  • If so many people are concerned about 'duplicate' questions. And so many people get involved in 'identifying' duplicate questions then why not build a mechanism to allow users to link/amalgamate question. Enough votes and the question gets amalgamated.
    – Chris
    Commented Jun 19, 2010 at 6:38
  • @Chris: good idea. Commented Jun 19, 2010 at 14:05

2 Answers 2

5

Why not download the data dump and give it a go yourself!

1
  • 2
    Because this is really really far from my area(s) of expertize. Code generators I can write. Symbolic math software, that too. Interpreters, tick. Data mining with an objective function? Natch. Commented Jun 19, 2010 at 1:20
0

Use https://www.mit.edu/~andoni/LSH/.

You must log in to answer this question.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.