CARVIEW |
October 20, 2008
Obscenity Filters: Bad Idea, or Incredibly Intercoursing Bad Idea?
I'm not a huge fan of The Daily WTF for reasons I've previously outlined. There is, however, the occasional gem -- such as this one posted by ezrec:
Browsing through a web archive of some old computer club conversations, I ran across this sentence:"Apple made the clbuttic mistake of forcing out their visionary - I mean, look at what NeXT has been up to!"
Hmm. "clbuttic".
Google "clbuttic" - thousands of hits!
There's a someone who call his car 'clbuttic'.
There are "Clbuttic Steam Engine" message boards.
Webster's dictionary - no help.
Hmm. What can this be?
As programmers, this isn't much of a mystery to us; it seems every day a brand new software developer is born and immediately begins repeating all the same mistakes we made years ago. I can't resist linking to Language Log again on this topic, where a commenter disputes whether or not this is an actual real world problem:
The "clbuttics" story may be a little exaggerated if not actually a web-legend. Sure, Google returns 4,000 hits?but by the time one reaches page 2 (in search of a page that isn't reporting on the silliness, or reporting on the reports, etc.) we're down to 200 hits.Almost all of those 200 seem to have a "clbuttic mistake" by Apple at their core. Google's redundancy-compacting routines are only invoked when requested, it seems, and even then, the variety of information in 200 hits may be small.
In short, it's an echo chamber. 200 or 4,000 or however many hits today aren't as impressive as the same number last year, etc. All the more so as web sites of all kinds put randomly chosen (even Googled!) words out there just to game Google.
While I agree this particular manifestation of the mistake is probably over-reported (because, haha, butts are funny) and fairly rare on the open web, I still get this shiner on page one of my search results:
Is the song Dueling Banjos considered blue grbutt?
Poor Bluegrass World. I'm pretty sure that site is legitimate, though I have no idea how they'd post an article in that state. Obligatory link to dueling banjos scene from Deliverance. I'm inclined to believe this is, in fact, still a problem. There are many, many examples besides "clbuttic" out there. Perhaps you've heard of the United States Consbreastution?
Of course, what we have here is failed obscenity filters implemented by (extremely) newbie developers with regular expressions. I could explain, but as they say, a picture is worth a thousand words, particularly when it's a picture of my very bestest friend, RegexBuddy:
Oh, great, an inexperienced developer had a problem, and thought they would use regular expressions. Now they have two problems. Well, technically through Google they now have many thousands of problems, but who's counting.
I'm not sure regular expressions are to blame here. The replacement is so mind-bendingly naive that it might as well have been a simple Replace
operation. We, being extra-smart-gets-things-done developers, would write a superior regular expression using the word boundary qualifier around the replacement, and use some capturing parens to handle both the singular and plural cases.
How about those Great Tits, eh?
Proving, yet again, that bad ideas are just plain bad ideas, regardless of language or implementation choices. Obscenity filters are like blacklists; using one is tantamount to admitting failure before you've even started.
But it still happens all the time. One of the most famous incidents was when the Yahoo! email developers created the accidental non-word Medireview. They weren't trying to filter obscenities, but JavaScript webmail exploits.
In 2001 Yahoo! silently introduced an email filter which changed some strings in HTML emails considered to be dangerous. While it was intended to stop spreading JavaScript viruses, no attempts were made to limit these string replacements to script sections and attributes, out of fear this would leave some loophole open. Additionally, word boundaries were not respected in the replacement.The list of replacements:
Javascript → java-script Jscript → j-script Vbscript → vb-script Livescript → live-script Eval → review Mocha → espresso Expression → statement
Some side-effects of this implementation:
medieval | → | medireview |
evaluation | → | reviewuation |
expressionist | → | statementist |
medireview.com is currently occupied by domain squatters. Perhaps that's a fitting end for this "company", though I perversely almost want the company to exist, as wholly formed from our imaginations, sort of like Jamcracker.
I can't help wondering just how freaked out the brass at Yahoo must have been about then-new JavaScript browser exploits to actually deploy such a brain-damaged "solution". To be fair, it was seven years ago, but still -- did it not occur to anyone that such broad replacement criteria might have some serious side-effects? Or that replacing one thing with another, when it comes to human beings and written language, is an activity that is fraught with peril even in the best possible circumstances?
Obscenity filtering is an enduring, maybe even timeless problem. I'm doubtful it will ever be possible to solve this particular problem through code alone. But it seems some companies and developers can't stop tilting at that windmill. Which means you might want to think twice before you move to Scunthorpe.
[advertisement] Read the largest case study ever published about lightweight peer code review in Best Kept Secrets of Peer Code Review. Free book, free shipping. |
Good post, perhaps one day there'll be an effective white list plugin for the popular blogs complete with several gigabytes worth of dictionary?
Tom J Nowell on October 21, 2008 04:32 AMI've seen this in a company intranet a while back, all kinds of hilarity ensued!
Shannon on October 21, 2008 04:33 AMThis is the problem with software-buttisted obscenity filters.
www.codingthewheel.com on October 21, 2008 04:36 AMIt's a very difficult problem, especially when you're working on sites aimed at young people where there is any degree of user supplied content. Kids love to swear and they love to break stuff, so the only real way of getting it to work is through full moderation, but I remember implementing a content filter for a message board on an educational site one time - we were smart enough to have tests to pick up most swearing that was standalone, then cram all the letters together to catch anything that was done by hyphenating words and so on, substitute basic 1337 characters for the relevant letters to be ready for that one- all in all it was a pretty good basic system for this task and it worked really well. Also the file that contained the dictionary of forbidden words was great- a list of all the variants of every potentially rude word you can think of is surprisingly funny.
The one thing we weren't expecting was for the users to self-censor, so the first time we saw a message where the little dears were telling one another to "f*** off" was quite a surprise to us.
Turns out if you need to swear in a content filtered environment, the easy way is to drop in an html entity code for one of your letters.
Breakfast on October 21, 2008 04:37 AMWhen my kids were younger they used to play the flash games over on NickJr.com. My daughter, Kassandra, called me over saying that the game "wasn't working" - she was around 3, meaning that this could be anything from the focus being on the wrong app to she hit a new key sequence that would take me a few hours to undo. Thankfully, this one landed in the middle. The game would, you guessed it, not accept her name -- her name was a foul word. K-ass-andra.
Great implementation.
Al on October 21, 2008 04:37 AMThe funny thing is even Rock Band 2 has this problem. They call unsuitable Band names "not classy" and will refuse to list them on Xbox Live, but it's a mystery why the filter is triggered:
https://www.rockband.com/forums/showthread.php?t=90574
"I understand the certain 7 words used in a George Carlin bit can't be used as well as other profanity. That's fine. Now let's figure out some other things.
My wife made a band called "Stinkfoot" in honor of our cat who had an infected cut on his foot. (All fine now!) So, I'm guessing the word "Stink" is the equivalent of $%#& and *&^#$#$ now?"
Jeff Atwood on October 21, 2008 04:39 AMI find that it doesn't matter b/c people kan't spel ineeway.
David on October 21, 2008 04:47 AMDisney has tried several times to create a safe chat environment -- meaning one where people can't communicate anything negative. For example, they tried a system where only words from a whitelist are permitted. It never works, as people always find a way to route around the restrictions.
For a great post on the topic, see https://thefarmers.org/Habitat/2007/03/the_untold_history_of_toontown_1.html
Ilari Sani on October 21, 2008 04:49 AM"Buttcopulation."
(Well, really not all that much better...)
anon on October 21, 2008 04:51 AMFunny, just posted about this too. Was responding to an ITT, basically saying that automatic content filtering like that doesn't work. My favourite story was about Tyson Gay (you can guess where that's going):
https://www.guardian.co.uk/technology/blog/2008/jun/30/computerautocorrectssurname
Actually, one thought I had was that there's also a question of culture in this. I had to sit and think what was meant by 'clbuttic', 'cos here in the UK 'butt' is something you store rainwater in, part of a cigarette, or something football fans do with their heads...
Andy Burns on October 21, 2008 04:51 AMEven word boundaries are extremely flawed. Try discussing the works of Philip K. Dick on the internet...
Lodewijk on October 21, 2008 04:53 AMIlari, that is easily the best link I have read all day, and it speaks volumes about human nature.
"I want to stick my long-necked Giraffe up your fluffy white bunny."
https://thefarmers.org/Habitat/2007/03/the_untold_history_of_toontown_1.html
Jeff Atwood on October 21, 2008 04:54 AMAdd Cumbria to you list of difficult UK locations, had that pop up in one of my systems
stjohnroe on October 21, 2008 05:04 AMThese filters also have a social engineering effect too - my wife has stopped referring to our cats as pussies or pussy-cats in emails, because the emails are bounced. The effect being to kill off all but the obscene usage of the word and reinforce its offensiveness.
Joan Miro on October 21, 2008 05:11 AMI wonder if non-American countries have this problem. This urge to limit certain types of speech utterances at any cost strikes me as a particularly American fantasy. One of the more amusing things about the internet as a whole is how it creates a huge tension between the stated American desire for free expression and the fact that once Americans see what "free expression" really means, they recoil in horror.
Shmork on October 21, 2008 05:16 AMDid anyone find this post a bit ironic given Jeff's previous post on rolling his own HTML filter for Markdown?
https://www.codinghorror.com/blog/archives/001172.html
It seems when it comes to obscenity filtering, only fools would try and roll their own naive implementation, but when it comes to HTML filtering, well then, that's a different matter altogether.
David Avraamides on October 21, 2008 05:17 AMThe subject came up on a mailing list I'm on, and somebody mentioned a university site that tried to ban the word 'сialis', making it impossible to discuss socialism...
(The 'c' in the above example is actually a Cyrillic 's', because the site, with an impressive ironic flair, gave me a "Your comment could not be submitted due to questionable content" message the first time.)
jim on October 21, 2008 05:17 AMOK, Jeff, you got me that time. Once more: сialis
jim on October 21, 2008 05:18 AMWell, you know what the word is, anyway. I shall admit to having been defeated by /your/ filter ;)
jim on October 21, 2008 05:19 AMOf course, as I say that about Americans, it's no doubt that China is probably better at censoring the internet. But at least they don't pretend to care about free speech as much. My understanding of the Chinese approach is to skip automated filters for the most part and instead use an army of human censors. If I were Disney I'd probably do it that way ? although I'd outsource my censors to India or China.
Shmork on October 21, 2008 05:22 AM"The effect being to kill off all but the obscene usage of the word and reinforce its offensiveness."
On a related note, watch the Lemon Demon "Song of the Count" video -- https://www.youtube.com/watch?v=6AXPnH0C9UA -- and try to hear the original lyrics instead of what your brain fills in due to years upon years of media conditioning. (I'm not being a conspiracy theorist; you'll understand what I mean when you watch it.)
Eric Meyer on October 21, 2008 05:35 AM> when it comes to HTML filtering, well then, that's a different matter altogether
they sort of are two totally different things -- HTML is a parseable computer language, not an infinitely malleable human language.
Compare:
1) How many different ways can you enter "penis"?
2) How many different ways can you enter a hyperlink?
Sure, #2 is large (har har) but #1 is pretty much INFINITE.
Jeff Atwood on October 21, 2008 05:35 AMDon't forget foreign names too. F**k and s**t are common in some Japanese and German names.
monsur on October 21, 2008 05:36 AMWhen I was a bit younger and definetely more innocent, I used to play the game Ultima V. It's still my all-time favorite game. I'm from Norway, and at the time my english wasn't perfect. It still isn't, but it's at least better. I know some curses now.
Anyway, in Ultima V you had actual conversations with in-game characters. Not the variant that's popular these days, when you get maybe three alternatives, and click on the one you'd like to say to the character. No, you actually had to type your question. Or, rather, you had to type at least the four first letters of your questions. The internal database had some keywords that triggered the appropriate response. So when I typed "job" the character would answer something like "I'm a blacksmith". And if I then typed "blacksmith" he would elaborate. Whenever I typed a dirty word, the character would say "With language like that, how did you become an avatar?".
Then one day I spoke with "Sven". I asked him about his "job". He told me he was a glassblower. I asked him about glass, and he told me some stuff about glass. I asked him about "blow" and he told me off for being naughty. And, being young an naive (and not english) I didn't understand why. Ah, good old days. These days if I ask someone about "blow" and "job" at least I know what I'm asking for :-)
Svein Bringsli on October 21, 2008 05:38 AMWorse than useless, filters can encourage swearing. If you set an arbitrary line of what is unacceptable you are making everything which isn't caught implicitly acceptable. I was once told of a forum which was generally quite polite (apparently they do exist :-) and never had a problem with swearing until they implemented a filter. After that it was socially acceptable to say sh1t because "the system said you could".
Andrew Shirley on October 21, 2008 05:43 AMEven more annoying are the ones that wipe out your entire sentence or post. I've been on forums before and had my paragraph changed to something like "_____ said a bad word"
That is frustrating, because you just lost everything you typed. Also, once people figure out you have a swear filter they start doing things to avoid it such as s<b></b>hit and sh1t. It really is quite pointless.
Billkamm on October 21, 2008 05:46 AMOn another related note I have been on other forums that when I click the "Post" button I get an error that says "Your post contains an inappropriate word" or something similar to that without telling you what the word is. On that particular forum I've had it block things like "stupid" but not "ass". It was very random.
Billkamm on October 21, 2008 05:48 AMWord filters are bullshit.
Fuck censorship.
TraumaPony on October 21, 2008 05:51 AM>> United States Consbreastitution?
You left the word 'tit' in as well as your replacement :)
Word filters are always going to fail, we can't account for every variation. T1Ts won't get picked up by a regular expression but any person will see the word right away.
I always try and steer clients towards a path of moderation and po,icy enforcement, but for those who must have word filters its best that you make it clear right at the start that not everything will be caught...people will always find a way to be naughty,
"I want to stick my long-necked Giraffe up your fluffy white bunny."
perfect example :)
Aaron Bassett on October 21, 2008 05:59 AMInterestingly enough, I just posted something yesterday on my blog where something slipped by me and I didn't notice until someone pointed out that I had unfortunately left out the first vowel in the word "count". I wouldn't have minded at least a little squiggly underline (maybe in yellow instead of red?) to suggest that in an article on HttpWebRequests and cookie-based authentication such a word was somewhat "suspect". After all, Google is my blogging platform, and while statistically they "know" that both the word I wanted and the word I actually typed are valid spellings of those two words, they could probably also intuit that one doesn't "belong" in the current article. No heavy-handed replacement strategy, just a nudge would have done it.
Jim on October 21, 2008 06:01 AMHere's a modest (but doomed) suggestion -- how about we all stop freaking out about a word and worry about ideas instead? The idea that you can somehow "clean up" the internet by substituting one word for another that means the same thing is breathtakingly illogical.
A. L. Flanagan on October 21, 2008 06:13 AMGreat tits like coconuts.
Jamie on October 21, 2008 06:16 AM<i>Word filters are bullshit.<br><br>Fuck censorship.</i><br><br>Those of us who browse the internet with our young children, and work for employers who do keyword-based traffic monitoring, might beg to differ. But those subtleties, and probably anything that doesn't directly relate to you, are most likely lost on you.
dan on October 21, 2008 06:18 AMMy favourite filter story is from the game Puzzle Pirates. The filter they used was to replace whole words with "piratey" equivalents, which worked really well - the decent players would, half the time, just say the piratey equivalents, and the kiddies would whinge that they were saying "barnacle". It ended up assisting roleplaying, and because it was optional you'd just turn it off if it became a problem.
But the best bit is the way it was implemented. See, it was a space-delimited text file, and on the day it was implemented, there was a space missing.
The filter *went the other way*, inserting swear words into regular speech.
Merus on October 21, 2008 06:21 AM"Those of us who browse the internet with our young children, and work for employers who do keyword-based traffic monitoring, might beg to differ. But those subtleties, and probably anything that doesn't directly relate to you, are most likely lost on you."
So what if children see the word "fuck"? It's a word. Boo hoo.
TraumaPony on October 21, 2008 06:24 AMThe best bit on Bluegrass World is all the way at the bottom, in the related links where
"Bob Paisley and Charlie Cline have pbutted away"
HA... does no one look at that site after they write it? I guess I wouldn't be surprised.
Jim on October 21, 2008 06:37 AM
I once saw a videogame in a pub refuse to accept a player's name - he had entered "geoff". (i.e. obscenity = true if name =~ /off$/)
Years ago I was trying to register on MSN and I had to try many times before I figured out why it didn't accept my data. My surname is "Grootel". In the end I submitted my surname as "Gro otel". Fu ck that.
Joe on October 21, 2008 06:58 AMInteresting timing.
This is was just going around the news sites today :
https://arstechnica.com/news.ars/post/20081020-microsoft-gets-patent-for-real-time-f-bomb-bleeping.html
I'd love to see this work. (Or not.)
AndyL on October 21, 2008 07:03 AM"Obscenity filtering is an enduring, maybe even timeless problem. I'm doubtful it will ever be possible to solve this particular problem through code alone. "
yeah, you're right. programming is too hard, let's go shopping!
David on October 21, 2008 07:23 AMI disagree. World filters, when done well, are hilarious.
The Washington Capitals message board is a great example. Every variant of every bad word has a different word that you get leading to extremely funny effects.
So quit be a your indoor sporting event whiny donkey from another mother about how most folks don't have a sense of humor when it comes to bad word filters.
Let's face some facts, though. The internet is not an appropriate place for kids. No amount of word filtering can change that.
Michael on October 21, 2008 07:39 AM@AndyL
Yep looks like MS wants to deploy that crap on xbox live. The problem is that usually programs like that have to use neural networks and such for training and they are quite naive. So there will probably be more errors than anything else until the sytem learns and recognizes all the speech patterns of anyone who uses it.
Not an easy problem. Eventually users will just start speaking in incoherent accents just to game that system too.
"Thou shalt not censor the Internet." -God
Josh Stodola on October 21, 2008 07:44 AM"Great breasts come in many races."
Amen to that.
SNF on October 21, 2008 07:52 AMFor those who laugh at this problem and write it off as web silliness, this exact sort of naive substitution has actually contributed to the result of a major U.S. Federal Court case on at least one occasion.
The case was 2005's Kitzmiller v. Dover, regarding teaching "intelligent design" (ID) in schools. A key element in the case was whether ID was actually just creationism in disquise. This is because the Supreme Court already ruled that creationism is off limits in public schools back in 1987.
The plaintiffs subpoenaed early drafts of the ID textbook that was going to be used in this particular school. They found this weird term "cdesign proponentsists" several times in the book. The editors had done a naive search and replace of "creationist" for "design proponent" (or some variation of that). This proved to the court that what they were talking about really was creationism.
Many details including scans of the pages here:
https://www2.ncseweb.org/wp/?p=80
I think this is a particular problem for the inhabitants of the British town, Scunthorpe.
Similarly, EVE-Online, an MMO based around flying "spaceships", has forums that wont let you use the term Cockpit :)
Russ C. on October 21, 2008 08:30 AMSo, then, what is your solution to sustain a child friendly enviroment on the internets, Jeff ?
J. Stoever on October 21, 2008 08:48 AMMy 10 year old is a level 136 on ToonTown. He often plays two accounts simultaneously. A few weeks ago he was joking around and he got busted for using bad language between his own two accounts. He was banned for 72 hours and I had to call and talk to a Disney person promising that he'll never, ever do that again to get the account reinstated even after the 72 ban.
It was such a pain that I guaranteed him I'd personally delete the 136 level character if he gets busted again.
twmcneil on October 21, 2008 09:00 AM"Those of us who browse the internet with our young children, and work for employers who do keyword-based traffic monitoring, might beg to differ. But those subtleties, and probably anything that doesn't directly relate to you, are most likely lost on you."
So what if children see the word "fuck"? It's a word. Boo hoo.
----
So that's an interesting point. I fall under the category of people who think that words shouldn't be considered "bad." If you don't like the meaning of the word, you may have other problems, but fundamentally, there should be no difference between the words "fuck" and "sex" (when used as a verb). Or "cunt", "vagina", and "pussy." Don't want your kids knowing about sex organs too early? Then none of the above should be acceptable. But picking out one and saying that it's bad is really just stupid, and even worse is substituting another slang word like "hoohaa."
But all this has a caveat. As humans, we're social creatures. We live, work, and interact with other people. As a father, despite how I feel about 'bad' words, I don't want my kid to be stigmatized or ostracized. I don't want his teachers to punish him just because a few tight-asses got together and decided that a certain combination of syllables shouldn't be spoken. As such, yeah, I don't want him seeing (or worse, hearing) the words because he may repeat them in the company of people who will think badly of him for it.
The joke of it all is that I realize that I'm perpetuating the system by feeling this way. We'll never get past this idiocy if we continue to use patterns which reinforce the behavior. But while I might be willing to sacrifice my own reputation for a cause like this, I won't sacrifice my child's. And so the cycle continues.
sanc on October 21, 2008 09:13 AM@J. Stoever
The internet can not be made child friendly.
Software is no replacement for good parenting; you can not abdicate your responsibility to ensure their browsing is safe.
Make sure you can always supervise their usage!
Simon
Simon Johnson on October 21, 2008 09:15 AMI'm glad <a href="https://thefarmers.org/Habitat/2007/03/the_untold_history_of_toontown_1.html">ToonTown</a>; has been brought up. People need to understand that example and the implications of it.
Unfortunately, the lesson at the end of the day seems to be that if you want to be protected from offense, don't communicate with strangers. And that pretty much takes the fun out of most of the internet.
Mark Tomczak on October 21, 2008 09:16 AMI heard a Billy Bragg concert on the radio a long time ago, where he told the following joke between songs:
"Did you know that there are three English football teams that have swear words in their names? Arsenal, Scunthorpe, and... Manchester Fucking United."
Irving Reid on October 21, 2008 09:40 AMTraumaPony : So what if children see the
TraumaPony : word "fuck"? It's a word. Boo hoo.
What if someone were to call your wife a slut and your mother a bitch, would you reply "It's a word. Boo hoo."?
Chris on October 21, 2008 09:46 AMHaving been to Scunthorpe I really don't mind if obscenity filers prevent me from searching the web for it. Call it natural selection, web style.
Seriously though, it's an absolute pain when dealing with over-zealous obscenity filters on message boards when you're using language which may actually have a clean, non-obscene usage.
As for a language's development isn't it in the preservation of decency which has actually led us to develop some of these fantastic colloquialisms in order to slip through the censorship net? Perhaps obscenity filters should actually become thesauri, intelligently providing synonyms and etymology of words rather than filtering them out? At least that way our kids actually get to learn how to use the words correctly...
Nikki on October 21, 2008 09:49 AM"TraumaPony : So what if children see the
TraumaPony : word "fuck"? It's a word. Boo hoo.
What if someone were to call your wife a slut and your mother a bitch, would you reply "It's a word. Boo hoo."?"
Chris, you miss the point. The sentiment behind the word is what should offend, not the word itself. So I'd be offended in those cases, just as I'd be offended if they called her promiscuous.
sanc on October 21, 2008 09:53 AMYou're right. Programming is hard. Let's go shopping.
I used to be in the trivia chat rooms at yahoo. One of the favorite questions was "WHo wrote Great Expectations?" Ans: "Charles Richardens" (that's how it would appear).
N. Velope on October 21, 2008 10:02 AMPeople are too sensative these days. We don't need to stinking obsencity filters.
Kris on October 21, 2008 10:29 AMI worked for a company that created kiosks allowing customers access to what was formally a database with only internal access. The data entry for new records was done by low wage, part time employees so the corporate office created a task of automatically purging "funny" records, including those that weren't directly obscene, but merely sound-a-likes.
There was lots of self-congratulation until the angry phone call from Mr. Phuc Ho.
sburnap on October 21, 2008 10:53 AMIf you're going to use a straight word -> word filter, at least be intelligent about what you're replacing.
https://www.morewords.com/?*eval*
https://www.morewords.com/?*ass*
Also exciting is when the internet service itself gets tripped up in its own filters, as the old example from bash.org demonstrates:
<a href="https://bash.org/?178890">https://bash.org/?178890</a>;
*sigh* Not to complain too loudly, but...
The first time, I failed to notice the decent-sized red 'no HTML' sign. My mistake.
The SECOND time, I failed to enter the CAPTCHA word and then just re-entered it and re-submitted, resulting in the mangled mess above.
My fault for assuming that on a failed "Post" no changes would be made to my input, but...
Mark Tomczak on October 21, 2008 11:20 AM"to sustain a child friendly environment on the internets, Jeff ?"
Can I get a filter that reverts any 'hilarious' misspelling memes, or 'hilarious' made-up-word memes back to their proper English forms?
It could have a blacklist that is updated once a week with the previous week's memes. I cay say with a high degree of certaincy that I would not lose any comedy by censoring "hilarious" memes that were positively identified an entire week ago.
AndyL on October 21, 2008 11:34 AMsanc: "What if someone were to call your wife a slut and your mother a bitch, would you reply "It's a word. Boo hoo."?
You might be better off replac^Wcontent-filtering the offending words "wife" and "mother".
Vinzent Hoefler on October 21, 2008 12:12 PMGoogle image search doesn't work well. It procudes all kinds of bad and mixed up images but not much what I am searching for.
Silvercode on October 21, 2008 12:46 PMContext. Without understanding context, you can't understand what needs to be removed. I would think of all the places this might be solved, it would be speech recognition (context is very important there), but as we all know that's an area that still isn't solved either.
Erik Porter on October 21, 2008 12:51 PMMy girfriend lives in ScuntHORPE and the Scunthorpe problem keeps catching me out.
I work in a school, we have filtered Internet access to prevent the kids from searching for porn, etc. So if I ask Google Maps to generate me a map taking me from Scunthorpe to somewhere... it's banned.
And yet I can type "holly" into Google Images and get pages of porn.
Web filters will ALWAYS fail because once you realise there's one in place, the game becomes "how can I defeat it?", and there's nothing better at brute-force testing than a class full of bored students with Internet access.
Really, if you think your web filter is any good, deploy it in a school and then look at the logs.
It'll take them about half an hour to find a web proxy - even if you blacklist the word "proxy". And once they're on a proxy everything is defeated.
Our web filter is notoriously bad, banning parts of session strings, or branding parts of Google as being "Blacklisted image filter" or "weighted phrase limit exceeded". One day the whole of Google was banned, searching for "The Simpsons" doesn't work, neither does "computer keyboard" or "The BBC". We have no idea why.
Unfortunately we buy our filtered Internet access from our local education authority, so have no control over it.
James on October 21, 2008 01:02 PMOh, I remember that way too well. There are actually much older systems that tried to filter obscene words. I remember myself how I wrote f*** and sh** so many years ago, as "fuck" and "shit" caused your post to never go anywhere. I fail to see why censoring people. What's so bad if I say "fuck"? Will the world stop spinning? I said the evil word, so what? If people want a kid-safe network, they should start creating a .kid first level domain, with very strict rules who may obtain one (and regular checks what people are doing with it). This part of the web can then be highly censored, so I guess this will have two effects again: Children will find out tricks to circumvent the filtering (as they always do ;-)) and teenagers will bother their parents to remove the Internet filtering as soon as possible, saying ".kinds is only for babies. Forcing me to only surf on these pages is like forcing me to play in the sandbox. I'm getting to old for that"... and I remember way too well, if you only keep bothering parents long enough (and if you are really good at it), they will give in sooner or later :-P
I said it before and I like to repeat myself here:
Technology is not for solving society problems. It's hardly ever the cause of such a problem, thus why should it be the solution to one?
Mecki on October 21, 2008 01:13 PMJust a few weeks ago, every single lower-case 't' on www.cisco.com was missing. I mean all of them - the entire rendered HTML had been post-processed with a rampant reg ex, so all the javascript and CSS was broken too. I took a screenshot if anyone is interested:
https://img396.imageshack.us/img396/66/ciscotfailwv8.jpg
We had a fun case when they tried to set up filters like that on our work email server some years back. A large amount of email started getting inadvertently blocked, which for the most part was from, or in reply to, anyone with "Analyst" in their job title (and therefore in their email signature).
Simon on October 21, 2008 01:22 PMMy favorite example was a filter to ensure proper references to the Queen of England. Applied to a story about honeybees.
"With its highly evolved social structure of tens of thousands of worker bees commanded by Queen Elizabeth, the honey bee genome could also improve the search for genes linked to social behavior.
...
Queen Elizabeth has 10 times the lifespan of workers, and lays up to 2,000 eggs a day."
Tom Finnigan on October 21, 2008 01:22 PMI disagreeumptions have been proven right...
James D on October 21, 2008 01:43 PMYeah, it's really not a good idea to just do global replaces for words that you "think" are obscene. I can see a programmer doing it if they are working on a forum for Disney or something.
You could always just replace Carlin's Seven Dirty Words. Most of those don't appear as subsets of other words.
https://en.wikipedia.org/wiki/Seven_dirty_words
But then even if you filter, people will find ways around it.
fuck could be fcuk, kcuf, f0k, phawk or p#4K, chinga in Spanish, or even a fake word that means the same from a TV show like frell or frak.
It is entertaining though, to see how far people will go in filtering kids toys. I've tested a few of my kids noisy spelling toys, and a few of them will say something like "oops!" if you spell a curse word.
Dave on October 21, 2008 02:43 PMOne could argue the 'superior' regex is still flawed
\btit(s?)\b => breast$1
Following the capture as little as possible guideline.
Totally whining and nitpicking of course.
1) How many different ways can you enter "penis"?
that's what she said
jamesL on October 21, 2008 03:35 PMGoogle could do a better job if they wanted to. They are able to filter content quite effectively in various countries, at the behest of those countries many times...
This is such a silly game.
Pardeep on October 21, 2008 03:38 PMGood post.
I implemented one of these systems. In Australia all data on a user actually belongs to the user and they can get a copy whenever they want. So I provided warning mechanism that attempts to reduce swear words in notes on that customer while allowing the users to proceed if they are sure it is OK. Also, it doesn't stop the user saving because the note may quote the customer swearing at us.
However, there were some interesting parts of this project:
1. It is actually quite a hard task to locate a good list of offensive words. Many words you think are OK are offensive to some people.
2. Context is key. There are religious words that are ok in some situations and offensive in others. Also, we have found that there are many people with last names such as "Cockburn" (pronounced Coburn), Dong and... well, you get the picture.
Several years ago, I wrote a porn filter for a web search engine. The goal was to detect porn, not substitute "naughty" words. Testing it was quite interesting. I was paid to browse porn for a while.
It was also interesting trying to get the word list internationalized. We contracted with free-lance translators and we'd get back non porn words. For example, if we asked for the translation of "tit" we'd get "breast" in the target language, which was not what we wanted.
The funny thing about the Rock Band band name filtering is that it's totally ineffective (as almost all automated censorship is). The Rock Band leaderboards at https://www.rockband.com/leaderboards include such gems as "The Real Qu33fs", "4-Way Beaver Bump", and "Black Super N Words" (among others) in the top 100 ranked bands alone.
Obscenity filters encourage obscenity.
Patrick on October 21, 2008 07:09 PMSince when are the words "tits" and "breasts" considered offensive?
DMB on October 21, 2008 08:48 PMI like a little profanity sometimes.. Screw the f#$%ing filter! :)
Tim on October 21, 2008 09:17 PMI regularly circumvent obscenity filters in forums by inserting bogus tags in the middle of the word to create false word boundaries. The output is the same, but the filter can't catch it.
Except I normally do this because obscenities are fun.
WurdBendur on October 21, 2008 10:13 PMWell, what happens if you are writing an article in a dieting chat room about chicken tits. Oh... wait.
Chris Tillman on October 21, 2008 10:38 PMIIRC in the new Battlestar Galatic they use the word frak instead of fuck. Like others have pointed out, how is using substitute words that mean the same thing any better then the original offending word. :?
Grom on October 21, 2008 11:45 PMSTOP ENGINEERING! This offends me .... :)
True story, last night, i was unable to enter the phrase "Engineering Deck" into the 'Location' field of my Xbox Live Gamertag. I was curious and tried to figure out what exactly was wrong with that phrase.
It turns out that "Enineering Deck" (note the missing 'g') is perfectly acceptable for Xbox Live. I can only assume i was subconsciously trying to spell out the technically correct female reproductive organ but spelling it incorrectly. I have no other ideas what could be offensive about any part of the word "Engineering".
Jebus, i am a fracking Software Venginaar myself! :p
I love obscenity filtering, just remind me of a little story, in teh UK filtering is sometimes known as the Scunthorpe Problem
From Wikipedi <a href="https://en.wikipedia.org/wiki/Scunthorpe_Problem">https://en.wikipedia.org/wiki/Scunthorpe_Problem</a>;
"in 1996 in which America Online's dirty-word filter prevented residents from the town of Scunthorpe, North Lincolnshire, England from creating accounts with AOL, because the town's name contains the substring cunt.
Years later, Google's filters apparently made the same mistake, preventing residents from searching for local businesses that included Scunthorpe in their names."
albear on October 22, 2008 03:15 AMWell, Google really don't seem to be the kings of information retrieval and natural language models either. I recently blogged a bit about Lisp and its current reception by new-and-coming programmers. Invariably, the word 'Sexp' occurred.
Now I have a content warning. Proudly wearing the badge of obscenity, -3 on Charisma, +2 on Vitality.
Aleks on October 22, 2008 04:01 AMI had written a small program like this just for my learning purpose, and it wokred nice, I have to hunt the program now. But my implementation was very simple. It used to search for the whole word, rather than in the middle of word, or end of word, one thing that I didnt was checked for words with spaces between them, like for example if there is a word I can type its individual characters with spaces between the words, groups the individual characters and check if they form the whole word. It sounded liek a good idea then, I just wanted to check how far I can be successfuly in that. It looekd good. I used to store the list of banned words in a exncrypted text file.
Anand.V.V.N on October 22, 2008 07:12 AMSometimes there are political filters too. For a while earlier this year the town I live in -- Clintonville -- kept showing up as just ville in my comments on one site. They were apparently trying to ban political discussions from their non-political site by suppressing the candidate's names.
jt on October 22, 2008 07:38 AMBelgium man! Obscenity filters are a load of zarking fardwarks, a complete joojooflop!
Douglas Adams on October 22, 2008 08:34 AMJeff, if your only worry about moving to Scunthorpe is problems you may have with email/web accounts then you need to do a little more research, lol....
Sorry, that should obviously be Svaginahorpe...
Carl on October 22, 2008 09:15 AMI still remember the pet you could buy in World of Warcraft, but if you wanted to brag about it in the forums it came out as ****roach.
The big problem with language filters is that every time the filter gets activated it stands out of the context. Anyone else watch national TV and try to read lips to figure out what was said behind the bleep.
Mike on October 22, 2008 09:58 AMi remember playing worms armageddon with my friends and finding they replaced the f-word with "love" on the in game chat. those characters at team17 just love to love with us!
cowgod on October 22, 2008 10:11 AMThe one that hits me all of the time is the word "rape" is apparently filtered in the forums for World of Warcraft. Apparently, it was so commonly used that they needed to filter it.
But, they did the simple string-replace for it.
So, "heard it through the grapevine" hits the profanity filter and "drape" and many others. It's surprising how often that one hits.
Yet, "classic" is permitted. You get the feeling that once it was in production, the cost of fixing the regexp to be a little more intelligent is too great.
chapmand on October 22, 2008 10:27 AMWhy don't we all just become rational intelligent beings and rid ourselves of this ludicrous notion that some words are ?evil?. That they should be taboo and discarded from every day language or even more ridicules, reserved for special usage. Am I the only one that realizes there is no difference in a 3 year old saying butt vs. ass? If my 6 year old runs up to me and says ?that kid is f&%$* dumb?, I?d give him hell for judging another?s intelligence without knowing the person. And then of course I?d have to explain that the word really doesn?t mean anything negative but society has arbitrarily deemed it unacceptable and so you shouldn?t say it in front of the majority of people until the populace at large stops being twits. Doing a little research you?ll find that there are dozens of words that we (and our children) use today that many years ago were deemed highly offensive.
And yes, I disguised the profanity above to appease the censors and those who would be shocked, embarrassed or angered buy such filth. :S
But that?s not gonna happen, so I guess I?ll write my own f?n profanity filter.
> Great tits like coconuts.
Fruit flies like a banana.
emperor nasi goreng on October 22, 2008 10:31 PMThe real problem is that like Captcha it and DRM content filtering cannot work
To filter obscenity you need a strong AI that understands context and implied meaning and can filter new swearwords "Belgium man! Obscenity filters are a load of zarking fardwarks, a complete joojooflop!" but will not filter apparent swearwords used out of context "Great tits like coconuts"
The simpler solution that actually works is to monitor your kids internet usage by being a good parent rather than relying on filtering software
Jaster on October 23, 2008 01:46 AMJeff:
> The funny thing is even Rock Band 2 has this problem. They call unsuitable Band names "not classy" and will refuse to list them on Xbox Live, but it's a mystery why the filter is triggered:
Are you sure they don't call them "not clbutty"? ;-)
It would be fun to find a word filter that runs itself recursively, i.e. that runs over a text, then runs over its own output again until it can't find anything wrong with it anymore.
If you have such a word filter, try to find a word that will make it run in an infinite loop: i.e., the replacement word contains a sequence of letters which forms a profanity, then that gets replaced by something else that contains an illegal sequence, etc.
Jesper on October 23, 2008 06:11 AMWhat was the first banned word? "Evangelical."
The founder of the Jesuits banned calling Protestants evangelicals.
You're welcome. Enjoy your mind.
"What if someone were to call your wife a slut and your mother a bitch, would you reply "It's a word. Boo hoo."?"
Well, pretty much. My mum IS a bitch.
TraumaPony on October 23, 2008 09:12 AMThis is definitely a challenging problem, but can be managed to some degree. The most interesting part about filtering is word boundaries. Spaces become meaningless on the web due to "t h i n g s" like "t h i s". Punctuation is also worthless because of "t.h.i.s.". You can even consider using punctuation for letters like ~~~~|~~~~his.
Since my business sells a profanity filter, we spend a lot of time thinking about these problems and the issue of word sense disambiguation. Our filter uses a number of tactics together to help reduce false-positives, because after-all, those are the real problem, not finding the profanity.
Of course, like most solutions, we have problems with "Dick Cheney" and "dicklips", but we are working on solutions for tackling them. The issue isn't that computers can't figure this stuff out. The issue is that it can't do it in real time easily. WDS (word-sense disambiguration) often requires a lot of processing and computational analysis to be accurate. Most sites don't want to force users to wait that long.
Feel free to check out our solution and send us comments.
Brian Pontarelli on October 23, 2008 09:12 AMLook at my name and imagine how useful filtering has been for me...
Ben Sexton on October 23, 2008 04:14 PMiTunes seems to have gone a bit bananas with their filtering: https://news.bbc.co.uk/1/hi/entertainment/7688705.stm
I play online poker and there's obscenity filtering on the message window, despite everyone present being, by definition, an adult. Baffles me every time I see someone trivially sidestepping it by putting a full stop in the middle of the word, or reversing the c and the k (for instance).
Dave on October 24, 2008 05:33 AMHow about an optional filter? Regardless of what you think about censorship, think about the enduser of your application. What do *they* think about censorship?
Allow each user to choose censorship or not in the option pages. Everyone wins, including the person maintaining the filter (an optional filter means those to do want to use obscenities can freely use them, meaning far less clever work-arounds)
Krazy on October 24, 2008 07:23 AMAmusing. My one-off <a href='https://www.cafeaulait.org/greattits.html'>Great Tits</a> page is still one of the most heavily trafficked pages on my site off the main page, even though it's barely linked from anywhere.
Elliotte Rusty Harold on October 24, 2008 10:11 AMThere was (or is) a real case of some Russian "enterpreneurs" who decided to steal snapshot of Russian Wikipedia, make their own site off it and server ads. To remove references from wikipedia, they presumably used mass replace, of "wiki" to "encyclo". There's only one letter for "V" in Russia, so "Wiki" and "Viki" is the same for directly transliterated words such as Wikipedia. This resulted in many interesting articles, most notable of which is northern tribe of "Encyclongs", that became a small scale meme.
Sergey Shelukhin on October 24, 2008 10:38 PMThen there's the ever classic "dawizard" incident. (https://www.everything2.com/e2node/DaWizard)
Karl von L. on October 27, 2008 10:58 AMThere are some web services out there that filter profanity so you don't have to write your own. They seem to be pretty effective. I use WebPurify (www.webpurify.com) on my blog and it seems to work pretty well. It doesn't check for spelling mistakes though (-;
James Rosenstein on November 2, 2008 08:09 PMI remember once trying to discuss the forward ptookus in a football (American) chat room... (btw, on the Battlestar Galactica example, that one actually dates back to the *original* series, and the new one retained it in tribute)
silverpie on December 16, 2008 01:56 PMContent (c) 2009 Jeff Atwood. Logo image used with permission of the author. (c) 1993 Steven C. McConnell. All Rights Reserved. |