Something we have had to deal with in managing educational software with a writing aspect is trying to manage what is offensive to who, in what context and where is not universal at all.
One of the most prime examples, at one point a number of terms related to homosexuality had made it onto the list at the request of a larger district. These are also terms that are being reclaimed, and it was... a difficult problem to try to satisfy everyone, and it did upset other districts. I believe their patterns were all but removed eventually.
We have a fought over the list of definitions and every change provoked controversy. Our current solution is just that we mark items for teacher review but don't tell them why. We don't say they are offensive, we don't say what the problematic words are. We just say it might need review. That's worked pretty well so far.
All this is to say, policing speech is a problem best avoided.
Unfortunately, whether or not a term is really offensive is a combination of what it is, who said it, and when/where (at least in the common-sense definition). Unfortunately, because this is directly opposed to our (at least in the US, and in most countries rooted in liberalism) sense of fairness which says that rules should be applicable universally, across all people and in all contexts.
Which is to say… policing speech is a problem best avoided!
> Unfortunately, because this is directly opposed to our (at least in the US, and in most countries rooted in liberalism) sense of fairness which says that rules should be applicable universally, across all people and in all contexts.
The US is not even internally consistent about this - the legal definition of obscenity in the US is deferential to local community standards.
There’s internal inconsistency because different communities have different special issues that get us to violate our basic principles.
I worked in completely different field and I had to give up on flagging any variations of "shit". Turns out there's working-class boomers will utter some form or another in every other sentence. Nothing harmful just like "my brother is full of horse shit", "my job is bullshit".
“Shit” is pretty good because it is crass but not offensive (in the sense that it doesn’t target any particular group). And of course it describes a lot of what’s happening nowadays.
I'm post-boomer, but let me tell you, shit's still fucked.
Shitty to limit the use of shit to working-class boomers.
Typical cuss filter UX:
types something in live chat
some random word from the sentence gets censored out
"Why did this just got censored out?"
check urban disctionary
"Why?????"
Bonus points if its regular ethnonyms that are classified as profanities, so people from that place are having big trouble to tell where they are from.
I have vivid memories of Digg back in the day censoring out absolutely baffling things in the middle of otherwise regular words.
Go gently caress yourself
That was Something Awful unless you were logged in
Wow, how embarrbutting
Was really amused to see that a paper had English's most prominent profane word in it's abstract on arXiv last month for the first time:
https://arxiv.org/search/?query=fuck&searchtype=all&source=h...
though somebody did slip in a use in a comment earlier.
It's certainly an interesting data set, though it has no concept of severity. As far as I can tell, "doodoo" is the same as some racial slurs: we're 100% certain they're bad words.
If I type the word 'doodoo' I'm pretty sure I'm not swearing... Most probably telling someone about baby sounds.
oh, it's about baby something but that sentence didn't end the way I thought it would
Baby shark, shitshitshit.
Baby shark, shitshitshit.
Baby shark!
https://www.usatoday.com/story/news/nation/2020/10/06/oklaho...
Nit: why is Portuguese named "European Portuguese"? If anything, the language spoken in Brazil should be called "American Portuguese".
I think in this case volume wins out in that over 90% of Portuguese speakers are Brazilian Portuguese speakers. If anything it may one day just become "Portuguese" and "European Portuguese".
At that time, we will have niche dialects "American English" and "British English". "English" will be identified with the variety spoken in India. Please kindly do the needful good sir.
We might as well say European Spanish when referring to the language that originated in the Iberian Peninsula.
I legit thought this said "... rating of success" meaning how likely the project was to be successful on some metric based on the profane words therein. I recall there was a study(?) akin to that for the Linux kernel, as a frame of reference
Was it something like FPC where the PC was per commit?
Could have been in a language agnostic format (eg. csv)
I think the value add here is being a software package. The lists exist elsewhere and the package authors supplied sources. If you really need a combined list it should be trivial to generate it from the code.
"Beaver" unlikely to be used in profanity, eh?
Pretty unlikely. I can't remember the last time I heard anyone use it aside from talking about the actual animal it refers to.
It is a bit dated as slang in 2025.
I caught that too. Interesting set of examples. I can think of a profane use of beaver immediately, but not a unvalenced use of "asshat."
I think it was an intentional and good example to demonstrate that 0 means unlikely, not impossible.
I'm confused as to the purpose of all the zeros. Since this is far, far from a complete list of all English words, what's the difference between a word not being on the list vs a word being a zero?
I can kind of see "was this a word they considered and scored, vs. not considered?" when trying to assess whether the project is comprehensive, but from a programming standpoint, it just seems like it's going to have a lot of useless overhead, since by the time I'm looking up the word I don't care whether it's a zero or a miss.
(I also find the scoring of "2" for many of the words to be weird, like "yank," "chug," "looser" etc. as they can all have perfectly normal meanings.)
Looked at french words, most have a rating of 2 (mostly profane) even for words that are not profane at all (ex: envoyer => send), words that have a profane second meaning, but their non-profane meaning is also in common use (ex: morue => cod). Also "retard" just means "delay", I have never heard it used as profanity, maybe in Quebec? ("retard" in English would translate to "attardé" in French)
The Dutch word 'kunt' (je kunt = you can) gets censored in WoW because of 'cunt'. That is, if you have mature language filter on. I have this on because I have no interest in raging kids in said game, but I do want to read simple, common Dutch words. Annoys me to this day. CS gave the obvious answer (WONTFIX, with obvious workaround disabling the mature language filter altogether). It could be solved easily by looking at context instead of simple blacklisting. I connect from a Dutch IPv4. I sometimes talk Dutch. The same would be true for the other endpoint.
Do we know how exactly are these certainty ratings determined?
Edit: seems like it’s all arbitrary? E.g. in a PR[1] I saw random new words get added with no explanation of why a certain rating gets assigned.
[1]: https://github.com/words/cuss/pull/43/files (nsfw too).
These articles always remind me of some code for particle physics simulations. It was full of variables called anal_this and anal_that (because analysis).
Someone put a comment "stop calling variables ANAL!!! This is physics not an orgy!!!!"
I may have a copy on a disquette somewhere :)
I had a look at the French ones.
1/4 is normal everyday cussing
1/4 is cussing when the team is losing, but there are children around
1/4 are from the 17th century and I had a good laugh
1/4 are useful when driving
3 are actually bad (just an estimation :)).
The thing with French is that the cussing is quickly funny.
Good to know that "This package is safe."
When it comes to security, the only thing that beats warm fuzzy words are shiny security seals.
I am reminded of the late great Eudora, a Mac mail program. Late versions would flag ‘offensive’ terms in both outgoing and incoming messages. A hidden option setting would cause it to read aloud all flagged text.
Helpful tool for car makers.
Would have probably saved them from the Mitsibishi Pajero, Ford Pinto, Mazda Laputa
Downside is, it doesn’t analyze phonetics afaict. The hebrew Volkswagen Beetle (Hipushit) would have passed as fine.
It seems to require specifying all spelling variants of a word https://github.com/words/cuss/blob/6bab3fef250481e34ba55bc40...
And then fails to do that for words that are not uncommonly written with a space https://github.com/words/cuss/blob/6bab3fef250481e34ba55bc40...
Making this a complete list will probably be a challenge when it needs to be a byte-for-byte match
Where does the rating come from? Do you understand what all those words mean? It looks like you copied someone's rather subjective opinion. Because e.g. "bollo" and "caliente" aren't inherently profane in Spanish. Or do people think the hot water tap is leering at them? "Oye, tia, que caliente qu'et-ta!"
“Sureness” is a not really a word, I had to read through to understand what they meant. “Certainty” or “confidence” would be clearer.
Sureness is most certainly a word! It has been used by writers of the stature of Emerson ("the law holds with equal sureness for all right action") and Edith Wharton ("“The moment the reader loses faith in the author’s sureness of foot the chasm of improbability gapes.”)
It used to mean "certainty", as when T. H. Howard writes, "Uncertainty about our religious condition is quite as unsatisfactory as any doubt about our most sacred domestic relationships. Sureness is vital to peace, and the truly sanctified soul will live in the region of certainty."
But in more modern usage the word has a connotation slightly different from what the author of this library intends. Its meaning is closer to "assuredness": confidence matched with ability. For example, "Proust had an incredible sureness of touch in shedding this prophetic ray on his characters." (again from Edith Wharton).
Point taken. It's not really a common word in modern usage, and probably not really the word the author wants here.
Based on a list of (in part) profane words which includes:
addict africa amateur american angry arab
I assume this is meant as criticism, but to be fair to the list, it classifies 5 out of these 6 as 0, which apparently means {Use as a profanity = unlikely, Use in clean text = likely}, and the 6th one (addict) is a 'maybe' on both scales which seems fair to me: wouldn't a respectful source speak of addiction, addictive substances, people who are addicted, etc.?
From just this short list and a handful of other words I looked at, they seem to have done a reasonable job of classifying them, even if I see other issues such as completeness and what even is the purpose
Maybe it's because of dialects of English I'm less familiar with, but I don't see how these (all classified as "1") are more likely to be profanity than "beaver" (classified as "0")
- abortion
- abuse
- addict
- addicts
Somewhat related: What is with the rampant cursing nowadays? In the US people are openly saying f-word in professional settings, in public to strangers or acquaintances, in writing and video... seemingly everywhere even in calm normal conversations.
I don't remember it being like this decades ago. Is it just me? I remember people used to curse only in private conversation, when angry, and never at the office in meetings and professional contexts.
Yeah, there's been a pretty big generational shift, I think mostly from GenZ. I'd posit that texting/social media may be a reason.
I first went to grad school ~20 years ago, and no one cursed in class, especially not the professors.
I recently went back to school and got another masters, and nearly all the mid-20-year-olds drop f-bombs in regular classroom talk to the professor constantly, like they don't even hear that they're doing it. Some professors don't mind, and even respond in kind (though much more self-consciously), some are clearly displeased, but the students barely notice.
Yes it's particularly prevalent in the under-30 crowd, and especially people under 25. I don't know about teens, not around them very much these days.
Don't get me wrong, I used that word plenty when I was that age, but only among peers in informal settings. Never at work or when talking to a person in a respected position.
It's not just you, and I would say that there seems to have been a general coarsening of society. The other day I saw someone with a bumper sticker saying "I pooped today", which I did find funny, but I reflected that it never would've been socially acceptable 30 years ago or so. People seem to have rejected the idea that some things are not acceptable to discuss or display openly. See for example "let your freak flag fly" and so on.
There are pros and cons to it, I suppose. I don't think it's bad for gay people to be out of the closet, for example. But I also find stuff like the rampant swearing* or "I pooped today" to be a bit troubling as I get older and think "man I wouldn't want my kids to learn it's ok to talk like that".
* not casting stones, I have a very strong swearing habit myself that I try to curb. It's hard.
Maybe because this is how people communicate?
I am French and when I speak English I use fuck when someone fucked up. I also say sex when people are, well, fucking.
The f*k, g**y, m***ly and others are childish.
But there are appropriate contexts.
I don't know your line of work, but presumably there are contexts where you wouldn't say "fuck," like to your CEO, or your top client, or your kid's teacher, or something, right?
So people just have different opinions on where the line is, and that line has shifted to include more contexts. That's simply what people are noticing.
Never realized the "chunky" in my chunky peanut butter was so profane. /s