Finding duplicates on Stack Overflow, programmatically

Question

How would you go about finding duplicates questions on Stack Overflow? More generally, given access to a data-feed of the Stack Overflow material (titles, questions, answers, comments), what would be your design of a program which would find the 'most related' questions. "Most related" can be alternatively measured as 'most likely to be a duplicate' as well as 'most likely to be on the same topic'. These are co-related, but not identical.

Further, what programming language would you choose for this task? In particular, if your choice of language does not essentially boil down to 'because I know there are libraries to do most of this', then why that language?

Finally, do you think your solution would be fast enough to be used as a filter on Stack Overflow questions, to help reduce the amount of duplicate questions?

SO tries to do that already. When you type in your question title, it tries to find questions with similar titles. How well it works, well... that's another matter. — FrustratedWithFormsDesigner, Commented Jun 18, 2010 at 20:39
@jon really? it's so broad and speculative as to be NARQ from my perspective. — Jeff Atwood, Commented Jun 18, 2010 at 21:58
@Jeff: Possibly, but I'm pretty sure it doesn't belong on Meta. — Jon Seigel, Commented Jun 18, 2010 at 22:05
I was really rather careful in how I phrased my request so that it belonged on SO rather than on meta. I did not ask for a 'better' way to find duplicates or complain about duplicates. I asked for a design for how to achieve this - a programming question. — Jacques Carette, Commented Jun 19, 2010 at 1:17
If so many people are concerned about 'duplicate' questions. And so many people get involved in 'identifying' duplicate questions then why not build a mechanism to allow users to link/amalgamate question. Enough votes and the question gets amalgamated. — Chris, Commented Jun 19, 2010 at 6:38

Community · Accepted Answer · 2021-01-18 11:53:42Z

5

Why not download the data dump and give it a go yourself!

edited Jan 18, 2021 at 11:53

CommunityBot

1

answered Jun 18, 2010 at 21:57

Jeff AtwoodStaffMod

313k107 gold badges893 silver badges1.3k bronze badges

2

Because this is really really far from my area(s) of expertize. Code generators I can write. Symbolic math software, that too. Interpreters, tick. Data mining with an objective function? Natch.
– Jacques Carette
Commented Jun 19, 2010 at 1:20

Add a comment |

Rosinante · Accepted Answer · 2012-02-11 19:41:28Z

0

Use https://www.mit.edu/~andoni/LSH/.

answered Feb 11, 2012 at 19:41

Rosinante

44.7k11 gold badges83 silver badges172 bronze badges

Add a comment |

Stack Exchange Network

Finding duplicates on Stack Overflow, programmatically

2 Answers 2

You must log in to answer this question.

Linked

Hot Network Questions

Finding duplicates on Stack Overflow, programmatically

2 Answers 2

You must log in to answer this question.

Linked

Related

Hot Network Questions