Ideas to prevent spammers, make their work harder

Bug #1614403 reported by kaputtnik on 2016-08-18
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Widelands Website
Won't Fix
Low
kaputtnik

Bug Description

Also with the new captcha solution we got several spammers from India (it seems). I guess these are real human since some people do work for a few pennies. We could do nothing to prevent registering such accounts, but maybe we could make their work harder.

Some ideas:
- Allow only new topics if a specific time has left. So if a user creates a new topic, it is not allowed for him to create another new topic for about (f.e.) 1 hour. Maybe this would cause other implications because the spammer has time to find other places where he could write his spam (f.e. wiki)

- Creating a "phrases blacklist": If a post/topic contain one of theses phrases prevent saving/posting. Examples for phrases from the latest spam floods: "baba ji", "Baba ji", "Tantrik" or just a regex catching (telefone) numbers like "+919829791419". The admins should then be informed when this happens.

Other ideas welcome :-)

Related branches

Revision history for this message
SirVer (sirver) wrote :

I am convinced that the spammers in the forums are humans - we also had several attempts at spamming in the wiki which we always reverted. I think addressing these besides deleting their posts in a timely fashion is very hard....

My stance on this is that whatever is done to make life hard for spammers, it should not make life hard for regulars in the forum. It is a net loss if we fight spammers, but also alienate posters.

> Allow only new topics if a specific time has left.

this is something that I find would also affect regular posters. I do not like it for this reasons.

> Creating a "phrases blacklist"

This is fragile and easily circumvented. Spammers will just misspell words or leave out certain characters - they do it for spam in mails already. Also a SPAM classifier could be used on the backend - but they do have false positives and it is very annoying for regulars if you are flagged as SPAM without actually being spam.

A combination of this could be used: something like: "Your post was received, but our classifier is 85% certain that it contains SPAM. A admin will review your post in a timely manner and allow or reject it." And if the classifier does not complain, it will just go through.

Revelant SO question. I like the approach: basically it uses the akismet web service. So each comment would get send to akismet (if it is reachable) and if akismet says SPAM, we hold it back for publishing until review. If akismet greenlights it, we let it go through.

http://stackoverflow.com/questions/3641042/automatic-spam-filtering-or-flagging-for-django-or-python

Revision history for this message
GunChleoc (gunchleoc) wrote :

Askimet sounds like a plan.

If we are still having problems then, maybe we could also have an increased flood limit for new users only - have a special "New User" usergroup with automatic promotion to "Registered User" after x posts.

Revision history for this message
kaputtnik (franku) wrote :

I am in trouble with third party services... we have to notify all users about it and explain which data is send (beside each comment: IP-Adress, User-Agent).

We need an akismet key for this, available on https://wordpress.com/ or https://akismet.com/plans/ (if no one of us have this already)

Link to python api: http://www.voidspace.org.uk/python/akismet_python.html

> If we are still having problems then, maybe we could also have an increased flood limit for new users only - have a special "New User" usergroup with automatic promotion to "Registered User" after x posts.

That sounds good to me.

Revision history for this message
SirVer (sirver) wrote :

> I am in trouble with third party services... we have to notify all users about it and explain which data is send (beside each comment: IP-Adress, User-Agent).

According to the docs, the only required thing we have to send akismet is the comment - which is public information anyways, so no PIP (personally identifiable information). It says akismet does a better job at detecting spam if you also provide email, ip and user agent - but I think for our spammers, the SPAM is obvious enough in the comment. So I would start there.

The free plan looks reasonable to me for starters. Feel free to grab a key - it will end up in our local_settings.py eventually and all admins can find it there if needed.

> If we are still having problems then, maybe we could also have an increased flood limit for new users only - have a special "New User" usergroup with automatic promotion to "Registered User" after x posts.

I do not like this because it means that every spammer can still spam at least one message - for us admins there is no difference between 1 or 10 messages and for users it is only slightly less annoying. For real new users that have a lot to say though, they might run into the rate limiting which is awkward.

Revision history for this message
kaputtnik (franku) wrote :

According to https://akismet.com/development/api/#comment-check

the IP and the User agent is always required beside the comment.

Regardless of this and my personal aversion about services which i do not control, informing the users is an act of truthfulness and respect to the users, imho.

Revision history for this message
kaputtnik (franku) wrote :

What about using akismet only for registered users when they wright her first x posts/topics?

We could then add some text to the templates, like: "This is your first topic/post on widelands.org. To prevent spamming we use the akismet service [link]. Therefore some information (your comment, IP-adress and Browsers user agent is sent to this service. After x verified posts/topics the akismet service isn't used anymore if you post some content."

I believe most of the currently registered users are safe, so we do not need to inform them about this service, unless they haven't send x posts/topics.

Revision history for this message
GunChleoc (gunchleoc) wrote :

Sounds good to me.

Revision history for this message
SirVer (sirver) wrote :

> the IP and the User agent is always required beside the comment.

You are right, I was misunderstanding the python library I was reading. That is really unfortunate - because I agree that sending PII to other services feels iffy.

> What about using akismet only for registered users when they wright her first x posts/topics?

this can easily be circumvented - write some spam that looks fine to the classifier 5 times, then spam the forums.

How about only checking for SPAM in posts and edits that contain links? after all, this is what spammers are after - a backlink to their server.

Revision history for this message
kaputtnik (franku) wrote :

> How about only checking for SPAM in posts and edits that contain links? after all, this is what spammers are after - a backlink to their server.

In case of the "Babaji love problem solution" spammers this will not work, because they are only submitting phone numbers, no links.

By the way, thanks for deleting this users/posts and keep our forum clean :-)

Revision history for this message
kaputtnik (franku) wrote :

I couldn't believe that there is no ready solution out there... all i found are using the akismet service.

Wouldn't it be possible to make a pure python/django spam filter? Analyzing a comment/post/text by different analyzers and give them a weight...

1. Analyze newlines: How many newlines are in this comment regarding the number of words in each line?
2. Analyze the Text against some phrases: How often are specific phrases (love purchase, buy, ..) used in the text?
3. Analyze against numbers: Is there maybe a phone number in this text and how often does it appear?
4. Analyze if markup is intentionally used (f.e. double space at end of line)
5. Analyze external links: Maybe using a whitelist containing known image hoster or launchpad.*.
6. ...

Each analyzer returns a 'weight', a number showing how much the analyzer applies. If the overall weight of all analyzers is above a specific number, the comment is maybe spam.

Yes, i know, it's always easier said then done... just an idea...

Revision history for this message
GunChleoc (gunchleoc) wrote :

I use this on my phpBB forum:

http://www.stopforumspam.com/

It uses the username, e-mail and IP address. I guess that all services need this kind of information in order to work at all.

Another limit we might have is to put all posts that are written within the first 2 hours after account activation into a moderation queue.

Revision history for this message
SirVer (sirver) wrote :

The problem with heuristics is that they are simple to circumvent - all it requires is a little trial and error. We could do something more clever - for example using a classifier/machine learning to filter SPAM, but fact of the matter is: SPAM looks like content and the only reliable way of recognizing is, to see it happen multiple times, across sites.

that is why these services require IPs - they learn from multiple sites and recognize a spammer IP for a while. And of course they also do content analysis.

Revision history for this message
kaputtnik (franku) wrote :

Fighting against spammers is an everlasting task, imho. So a global working solution do not work at all.

But we should consider what our spam problems (currently) are: Most annoying is the "Babaji vashikaran problem solution" spams. Others are "package movers". The first one's could easily be fixed with some heuristics, imho. I think the spammers are not really smart. The "package movers" spams are a bit better, but they could also be easily prevented with some heuristics, imho.

Related to the akismet service:
The users IP and User agent is something that will ever be send during internet surfing. So in general they aren't problematic. E.g. if one surfs to akismet.com, the information is stored (or not) in their database. The problem i have with this service, is that i couldn't find a legal notice, or similar notice, which describes how this information is used beside checking of spam.

The service GunChleoc proposed in #11 is a bit better, because the information could be checked by our self by comparing user IP (and other information) against their database. But i 'believe' it does not solve our particular problems. But we may give it a try...

Revision history for this message
GunChleoc (gunchleoc) wrote :

While we don' have a solution for this, how about removing the forum news from the #widelands-de channel? We can't easily delete the spam there.

Revision history for this message
SirVer (sirver) wrote : Re: [Bug 1614403] Re: Ideas to prevent spammers, make their work harder

Janus controls this bot. We Should ask him in irc.

> Am 20.09.2016 um 12:11 schrieb GunChleoc <email address hidden>:
>
> While we don' have a solution for this, how about removing the forum
> news from the #widelands-de channel? We can't easily delete the spam
> there.
>
> --
> You received this bug notification because you are subscribed to
> Widelands Website.
> https://bugs.launchpad.net/bugs/1614403
>
> Title:
> Ideas to prevent spammers, make their work harder
>
> Status in Widelands Website:
> New
>
> Bug description:
> Also with the new captcha solution we got several spammers from India
> (it seems). I guess these are real human since some people do work for
> a few pennies. We could do nothing to prevent registering such
> accounts, but maybe we could make their work harder.
>
> Some ideas:
> - Allow only new topics if a specific time has left. So if a user creates a new topic, it is not allowed for him to create another new topic for about (f.e.) 1 hour. Maybe this would cause other implications because the spammer has time to find other places where he could write his spam (f.e. wiki)
>
> - Creating a "phrases blacklist": If a post/topic contain one of
> theses phrases prevent saving/posting. Examples for phrases from the
> latest spam floods: "baba ji", "Baba ji", "Tantrik" or just a regex
> catching (telefone) numbers like "+919829791419". The admins should
> then be informed when this happens.
>
> Other ideas welcome :-)
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/widelands-website/+bug/1614403/+subscriptions

Revision history for this message
GunChleoc (gunchleoc) wrote :

Looks like he needs it.

How about adding a math captcha?

https://github.com/alsoicode/django-simple-math-captcha

We could try it out before we do a lot of programming ourselves.

Revision history for this message
SirVer (sirver) wrote :

A captcha will not change anything besides annoying the other users on the forum: the posters are not bots, they are people. We have to completely prevent them from posting somehow.

Revision history for this message
kaputtnik (franku) wrote :

I guess that (in case of 'vashikaran baba ji') real humans are only 'used' to get an account and after that a bot is creating new topics. Reading this topic let me assume this: https://www.themightyquest.com/en/forums/website-and-social-media-suggestions/topics/achancetostopvashikaran

The solution there is to rename the links to not contain e.g. 'forum'. But i did not looked in our code if this will be easy for us and the topic over there shows also no success message.

Anyway: Since all this topics have strings like '+91' and a 10 digit number in it's topics name, as well as lots of 'love', 'vashikaran' and 'problem' in their topics text, it should be easy to prevent them.

As an alternative to make a spam filter by ourselves, i am in favor of using the akismet service described in #6 (using the service for the first x posts for a new user).

SirVer said:
> this can easily be circumvented - write some spam that looks fine to the classifier 5 times, then spam the forums.

I don't think this is a valid reason not to use this suggestion. At least we are able to classify spams from normal posts. So if a the first topics/posts are created we could react on them and delete the user. And i don't think that the 'vashikaran' spammer is smart enough to check this.

As said, fighting against spammers is an everlasting task i think. Complete preventing them without changing the workflow for normal users isn't possible imho.

So i suggest trying to make a spam filter by ourselves first. This should solve at least the 'vashikaran' spams (the most annoying ones these days). If that doesn't work, we could look for other solutions.

Revision history for this message
GunChleoc (gunchleoc) wrote :

Renaming will only work temporarily and break all the links from Launchpad bugs to forum topics.

Let's start with a keyword filter and see how it goes.

Revision history for this message
SirVer (sirver) wrote :

I thought about the following schema:

- Add a 'probability for SPAM' value to each post. This is 0 for all existing posts (no spam).
- Add a 'notority' value to each users profile. Initially this is set to 1 - we do not trust the new users.
- Each new post is classified before writing to the database. We can use akismet or similar or try our own heuristics first, or do a combination of those. This classification gives a value for the probability for the post in [0...1]. This value gets multiplied by the notoriety of the user. If we decide the post is not SPAM, we decrease the notoriety of the user by a bit so that it reaches 0 after ~50 non-spam posts. This guarantees that veteran users will not be misclassified as spammers - but it does not protect us from account stealing.

Revision history for this message
kaputtnik (franku) wrote :

Sound good to me.

I am thinking about what to do if a post is classified as spam (or potentially classified as spam):

Profile (only for new users or where the 'notoriety' is not 0):
1. Deactivate the user account and e-mail him about the deactivation. So he couldn't do anything anymore, even login isn't possible. This would maybe result in creating much user accounts. A check for always used E-mail addresses should make creating accounts much harder.

Forum:
1. The post/topic should not be visible to normal users but show it to admins (maybe with an other background). So one of us could decide if it is really spam.

Revision history for this message
GunChleoc (gunchleoc) wrote :

I think as a first step, hiding the posts only would be good. This way we can make sure that we don't accidentally remove non-spam.

We could display to the user on-screen that the post needs to be moderated rather than deactivating and sending an e-mail. I would like to get some form of notification though, so I can go in and publish or delete posts as appropriate.

kaputtnik (franku) on 2016-10-02
Changed in widelands-website:
assignee: nobody → kaputtnik (franku)
status: New → Confirmed
importance: Undecided → High
status: Confirmed → In Progress
Revision history for this message
kaputtnik (franku) wrote :

Today i found a spammer which username i deleted some day. Randomly i found also a spammer who added posts to existing topics, without creating new topics...

The attached branch has now following properties:

- Added a hardcoded keyword filter for testing purpose
- If one i writing a post containing 'vashikaran' AND 'baba' (case insentive) this post is hided to normal users. If this is the first post in a topic, the topic is also hided.
- Logged in superusers could see all hided posts/topics in the forums lists (no different coloring right now)
- If a potential spam is detected by the filter, the user get redirected to a page saying that his post is hided and has to be moderated. Please reread: http://bazaar.launchpad.net/~widelands-dev/widelands-website/anti_spam/view/head:/templates/pybb/pybb_moderate_info.html
- The akismet python api is added, but not used right now

If you want to test this with an existing database (e.g. dev.db), you should backup it and run './manage.py migrate' against it, create a non superuser account (e.g. 'spammer') and don't forget to add this user to wlprofile and Gggz. Then write some spam with this user containing both keywords (e.g. 'vasHikaran' and 'bAbA')

Revision history for this message
SirVer (sirver) wrote :

Is this ready for review, kaputtnik? I suggest launching this onto the site asap and iterate there.

Revision history for this message
kaputtnik (franku) wrote :

There is one missing feature in this branch: If a spam is added to an existing topic containing some posts, the last unhided post isn't shown up in the 'Latest posts' box. Since most spam these days are new topics, this shouldn't be a problem right now.

Some spam of today wouldn't be caught, because no keyword is used.

I make a local test against a copy of the mysql alpha db and propose for merging this evening.

> and iterate there.
Sorry, don't get the meaning of 'iterate' (wiederholen) here :-S

Revision history for this message
GunChleoc (gunchleoc) wrote :

Iterate means keep working on it to make it better. Informatikerspeak :P

Revision history for this message
kaputtnik (franku) wrote :

> Informatikerspeak :P

Thanks :-)

Revision history for this message
GunChleoc (gunchleoc) wrote :

I just noticed that some of the user names keep repeating, maybe we could prevent those from registering at all. For example, I have recently deleted "vijaykumar9929" 4 times.

vijaykumar9929
loveguruji312
vijaykumar9929
vijaykumar9929
sharmarahul5554
vijaykumar9929
karamd
karamd12

Revision history for this message
kaputtnik (franku) wrote :

I have also deleted vijaykumar9929 several times an thought the same. But then he may register with vijaykumar99[30..50], so this doesn't solve anything.

I guess the 'human' way of registering is:
1. One trys 'kaputtnik' as username
2. if this is already used, one will try 'kaputtnik[digit]'

If this is true we could might add a code:

If 'username' without numbering already exist
   allow 'username+numbers'
else
   disallow 'username+numbers'

But i think this too restrictive.

Revision history for this message
kaputtnik (franku) wrote :

The list of my deleted users:

<email address hidden>
<email address hidden>
molana
vijaykumar9929
vijaykumar9929
lucky
vijaykumar9929
fhfghfgh.vfhgfhg
jasonsmith
vijaykumar9929

It would also be nice to have all admins informed if a user was deleted.

Revision history for this message
GunChleoc (gunchleoc) wrote :

Yes, I also think that that would be too restrictive. Best have a banned keyword list like for topic titles and forum posts.

Revision history for this message
kaputtnik (franku) wrote :

Just an information about the merged and deployed branch:

Added to local_settings.py:
# Anti spam keywords:
ANTI_SPAM_BODY = ['vashikaran', ' baba', 'molvi ']
ANTI_SPAM_TOPIC = [' baba ', ' ji', 'molvi']

I decided to filter against appearing of ANY keywords in the posts text (before it was 'all keywords has to be in the text").

To unhide a post you have to go to the admin page pybb -> posts, search for the post and then you will find the hidden property under "Additional options(show)".

Revision history for this message
SirVer (sirver) wrote :

There have been another round of spam attacks tonight and that shows that we do not hide the posts in all places yet:

Forum listings: https://wl.widelands.org/forum/forum/2/ (screenshot attached)
RSS feeds: I checked "latest posts on all forums" and "latest topics on all forums". both contain the SPAM. I think that will be the same for the atom feeds.

And one minor bug: the latest posts section on the homepage should show a constant number - now it only shows 2 since the others in its selection are hidden.

Revision history for this message
SirVer (sirver) wrote :

There have been another round of spam attacks tonight and that shows that we do not hide the posts in all places yet:

Forum listings: https://wl.widelands.org/forum/forum/2/ (screenshot attached)
RSS feeds: I checked "latest posts on all forums" and "latest topics on all forums". both contain the SPAM. I think that will be the same for the atom feeds.

And one minor bug: the latest posts section on the homepage should show a constant number - now it only shows 2 since the others in its selection are hidden.

Revision history for this message
SirVer (sirver) wrote :
Revision history for this message
SirVer (sirver) wrote :
Revision history for this message
kaputtnik (franku) wrote :

Forum: You're screenshot shows that you are logged in. Since you're a superuser it should be shown there :-)

Rss-Feeds: Yes, i thought on them but didn't spend time to clean them up right now. Could one say how much they are used? I tried to gather them with my new smartphone, but it didn't work (the phone says the RSS-Feeds aren't valid?)

Latest Posts: This is the next thing i look into

As i remember vijaykumar wrotes much more than three topics if he is online. This time he wrotes only three. GunChleoc: could you confirm this?

Revision history for this message
GunChleoc (gunchleoc) wrote :

Yes, as I remember he tends to write multiple posts.

RSS-feeds: I expect that this is what janus' bot is using to autopost to the #widelands-de IRC channel, so cleaning them up is important.

Revision history for this message
kaputtnik (franku) wrote :

Thanks GunChleoc.

Is there a need to show the hidden posts/topics to superusers at all in the forums overviews? The e-mail of new topics shows a link to the topic, so hide the topic to normal users and showing the topic to superusers is maybe enough?

This would make some things easier :-)

The newly attached branch fixes:

- Latest posts box
- No hidden posts in feeds anymore

Revision history for this message
kaputtnik (franku) wrote :

strange... i just found a non hidden post from our superspammer he wrotes just a minute ago. Testing this post locally it get hided. Hm...

Revision history for this message
GunChleoc (gunchleoc) wrote :

Maybe showing the hidden property in the table for the admin posts overview would be the easiest option. Those table columns can be sorted by 1 mouse click, which would put all hidden posts at the top - maybe 2 mouse clicks depending on sort order.

Revision history for this message
kaputtnik (franku) wrote :

Good idea :-)

Just a note: During deletion of the user this morning the database has come to an inconsistent state and through errors when accessing post/feeds. I guess this was caused during deleting a topic and simultaneous save of a post. I wonder this wasn't happen earlier... the only way to solve this was to delete the post directly in the database on the server, because the saved post didn't show up on the admin page.

I would suggest to set a users status to Inactive instead of deleting him. At least when he is still online. This would also prevent registering with the same username.

During investigating the database i found several suspicious usernames, look into https://wl.widelands.org/admin/auth/user/?q=babaji

Some haven't activated their profile (status 'inactive'), others weren't ever be online (no 'last login'). We should consider to deactivate some of them, e.g. 'vashikaran' or 'molviji396'.

Revision history for this message
GunChleoc (gunchleoc) wrote :

I'm thinking that sorting users by last login date and adding
deactivating to the dropdown would speed things up.

Post number in the table would also be nice - accounts with 0 posts and
last login date > 5 years or so could be deleted.

Revision history for this message
kaputtnik (franku) wrote :

Finally this day it worked.

Our superspammer has postet two posts and all are hidden (i have tweaked the keywords a bit). I leave the posts for now, but set the user inactive... lets see how he react.

The two Django errors are triggered by me. I fixed this directly on the server and pushed the fix to main trunk.

If the REMOTE_ADDR branch is merged, i will look further into the spam issue.

Revision history for this message
SirVer (sirver) wrote :

Congratulations!!!

We had a more sneaky scammer last night still though: wrote a meaningful text (not related to Widelands) with a single link. I think it will be hard filtering for those...

Revision history for this message
kaputtnik (franku) wrote :

I saw this spam also, deleted it and set the user to not active.

But no time for congrats... i currently don't know why, but our superspammer could login even he is set inactive. And he could also write spams which are visible, even the keywords are used. Both do not work locally: If i copy and paste the spam posts shipped with e-mail to a new topic on my local installation it ever got hidden. Why does it not work on wl.widelands.org?
If i set a user inactive and try to login, it does not work (The login error "Username and Password didn't match" appears). This happens locally and i tried it also on widelands.org: Created a new user, activated him, set him inactive and then the new user isn't able to login. Why could the superspammer login?

My plan is for now:
- get the most annoying posts hidden (next step is to add the check for international phone numbers)
- then change the notifications to only inform admins (or forum moderators) about the hidden posts
- make the workflow better to make non spam visible or delete spam, like the suggestion in #41
- then add the akismet check, which may need also some changes to the userinterface

Revision history for this message
SirVer (sirver) wrote :

> even he is set inactive

Working hypothesis: could he have reused the last session cookie? I wonder what happens in the following scenario: make a user, login in one tab and keep the tab open. Set the user to inactive in another tab. Now try posting something in the first tab. If that works, the session cookie is still valid and we do not check for activeness in the right places.

your work plan sounds awesome to me!

Revision history for this message
kaputtnik (franku) wrote :

Setting a user to inactive applies only if he once had logged out. So the current session is valid as long he is logged in. After logging out he couldn't login anymore under normal circumstances. I tried to use a tool called Cookie manager and tried to restore the last cookies (stored while being marked as active), but i couldn't log in. But i am not good in such things...

Looking at the 'Last Login' field of the superspammer, the last login is always: 13.10.2016. Is this a hint that he always use an old session cookie?

Reading https://docs.djangoproject.com/en/1.8/topics/http/sessions/#browser-length-sessions-vs-persistent-sessions i believe the following happens (just tested):
- He never logs out, instead close the tab or the browser
- Nevertheless the username is shown for a while on the site for other users
- After some time the username disappears for other users -> it looks like he had logged out, but he had only closed the tab in the browser
- If he open the url again, he is still logged in with the same (formerly) session

From the database point of view this happens:
- A new user logs in: An entry in table django_sessions is created
- The user closes the tab (or the browser): The table entry get not deleted
- The user opens the widelands url again: Django updates the formerly created row in the table django_sessions

From my understanding the setting SESSION_COOKIE_AGE (default value: two weeks) has influence on this behavior. One session is valid for two weeks, as long one didn't explicitly logs out.

What we can do to prevent this is to reset all session data. This means every user has no stored session data in the database which will be updated and every user has to login the next time. See the Note in https://docs.djangoproject.com/en/1.8/topics/auth/customizing/#specifying-authentication-backends

I suggest the following changes:
- add setting SESSION_COOKIE_AGE and change default from 2 weeks to 1 or 2 days. From my point of view it could also be less than one day.
- if not already done: add management command 'clearsessions' to a cronjob. This will remove all entries in the session table of the database which are older than SESSION_COOKIE_AGE. See https://docs.djangoproject.com/en/1.8/topics/http/sessions/#clearing-the-session-store
- If a user is set to 'inactive' the entry in the session table should be immediately removed (i have currently no idea how to do this though)

Revision history for this message
SirVer (sirver) wrote :

> Looking at the 'Last Login' field of the superspammer, the last login is always: 13.10.2016. Is this a hint that he always use an old session cookie?

that sounds weird. I would also not expect a session cookie to be valid across users - so deleting its user should make it impossible to reuse the cookie.

> SESSION_COOKIE_AGE

I would really hate increasing this value. It is also the period a regular user has to re-login and it is very convenient that our users do not have to type their password everyday when going on the site, but only once every 2 weeks.

> if not already done: add management command 'clearsession'

There is already a cron file: /etc/cron.daily/django_regular_commands. This command was not included in there though, so I added it now.

> If a user is set to 'inactive' the entry in the session table should be immediately removed (i have currently no idea how to do this though)

maybe we could also check in some middleware that the user the session belongs to is not inactive. If it is, we show 505 instead of the page. See for example BannedIPMiddleware in tracking/middleware.py - that seems to be doing something similar.

Revision history for this message
GunChleoc (gunchleoc) wrote :

How about adding a check whenever a user loads a page to make sure that the user is still active? Something like

user_is_loggedin = user.has_session() && user.is_active()

Revision history for this message
kaputtnik (franku) wrote :

Ok, lets leave the SESSION_COOKIE_AGE as it is. The best way to log a user out is using the flush() method: https://docs.djangoproject.com/en/1.8/topics/http/sessions/#django.contrib.sessions.backends.base.SessionBase.flush

During testing this works just fine. Creating a middleware and testing each request for 'is_active' and run 'sessions.flush()' in case of 'not is_active' should work imho. So the user is logged out and couldn't log in anymore. But i am not sure if this is really a goal: Registering with another username is easy... resulting in lots of users which may have to set 'not active'. I would say: lets leave it as it is right now. Having hidden posts is better than having new users.

The current state of the work is:
- Moved spam keyword check to pybb/forms.py. This is a better place for doing this
- Inform only admins of hidden posts per e-mail. Not hidden posts are managed as usually. It needs additional work if a hidden post is not spam and got unhided (is 'uncovered' the right term here?) by an admin: The notifications needs updated and users must be noticed of a new post in subscribed topics
- changed the admin site of pybb/post and show the 'hidden' property as well as some modification of the fields which are shown. The admin page could be sorted by the 'hidden' property

I want to do some more tests before proposing to merge though.

Superspammer was active yesterday, but the one post he has written got hidden. The day before yesterday and today no spam. Added the term 'rsgold' to the keywords.

Revision history for this message
SirVer (sirver) wrote :

Your argument about the session cookie is convincing to me. Let's focus on the actual SPAM.

> (is 'uncovered' the right term here?)

no, I would call it unhide/unhidden to be symmetric with the flag name.

your approach to keeping SPAM in check seems to be working. thanks for doing this!

One nit: the "Latest posts" section still only shows a few (3 currently) on my system. Do you know why that is?

Revision history for this message
kaputtnik (franku) wrote :

> One nit: the "Latest posts" section still only shows a few (3 currently) on my system. Do you know why that is?

This behavior i encountered also before the last changes and i couldn't say why. Locally there are ever the last 5 posts shown, so i couldn't investigate it locally.

Revision history for this message
kaputtnik (franku) wrote :

Today a new superspammer was active: 41 posts/topics were created in 3 hours. One post/topic from the old known superspammer. All posts got hidden.
GunChleoc has found another bug: All topics were created in forum Homepage. Result is that the first two pages in this forum shows no topics anymore.

I have set the new spammer to inactive.

Revision history for this message
SirVer (sirver) wrote :

Seems like your approach is working great! Congrats.

> I have set the new spammer to inactive.

Can we not delete them as a rule of thumb? I really hate the thought of our database being slowly swamped in spam.

Revision history for this message
kaputtnik (franku) wrote :

Comparison of 'deleting a user' against 'setting a user inactive':

Deleting a user:
- He has to register from scratch
- He may use the same user name and email address when registering again

Setting a user inactive:
- As long he didn't log explicitly log out, the setting does not take effect. And if he once logged out or get logged out by us:
- He has to register from scratch and MUST use a different username, he could use the old email address

I couldn't see any advantage of setting a spammer inactive, in fact it could be bad because he MUST use another username whereas deleting him there is a chance that the username is always the same (giving some more hints to hide posts 'by username'). So i agree deleting a user is the better approach.

Revision history for this message
kaputtnik (franku) wrote :

There was a suggestion about a Django command for sending emails daily (or hourly) in case there is spam. I misunderstood this suggestion at that point... lead to a confusing post by me. But now i am working on such a command, as well on fixing the pagination and showing the topics list (see #54).

Just a question for usability if (potential) spam is found:

1. Do we need showing the spamposts for superusers in the forum views?
2. Or is it enough to have the admin page for that?

These questions have effect on the email send by the django command:
What content should the email send by a django command have? For now i have it like:

Found 7 hidden posts:
Text snippet: LOVE VASHIKARAN SPECIALIST BABA ji in +91-9799298747 Only one call and
Link: https://localhost:8000/forum/post/9179/
Text snippet: this is vashikaran
Link: https://localhost:8000/forum/post/9180/
[...]

The link points directly to the post. If we want this i add an additional button (beside "Delete") to give the opportunity to unhide the post. If you think this not needed i add only a link to the admin page of pybb/post. The first suggestion would be comfortable if we use another group (like Forum Admin) without rights accessing the admin site. That needs some further changes though...

Revision history for this message
GunChleoc (gunchleoc) wrote :

Since we are enough admins ATM to handle it, I think linking to the posts in the admin panel is sufficient.

Revision history for this message
SirVer (sirver) wrote :

+1 for option 2.

Revision history for this message
GunChleoc (gunchleoc) wrote :

One more thing for the todo-list: When spam posts are hidden, the forums still get marked as having new posts in them, adding the red star to the icon.

Revision history for this message
kaputtnik (franku) wrote :

I found this red mark only in https://wl.widelands.org/forum/ ... this is fixed now in the anti_spam_4 branch :-)

All other views should be fine. But i remember there were some difficulties with those marks in the past.

kaputtnik (franku) on 2017-01-19
Changed in widelands-website:
importance: High → Medium
Revision history for this message
kaputtnik (franku) wrote :

Since the used approach is working for now, i consider this bug as fixed.

In the last month there was one false positive (a none spam post was hid), but several posts got hid.

Leaving this bug as 'in progress' but mark it as low importance.

Changed in widelands-website:
importance: Medium → Low
Revision history for this message
kaputtnik (franku) wrote :

The last spams were are all written in an asian language. I think a serious user didn't write that way, because he can see that all posts are english/german/spanish/francaise.

Should we try to hide all posts written in a non latin language?

Revision history for this message
SirVer (sirver) wrote :

I’d not like that. I feel it is our fault that we are not more inclusive towards non-English speakers - and we have no offer for Asian natives at all. That is bad enough, but understandable because our community does not have somebody who could lead and moderate an Asian sub community. Maybe that person shows up eventually though.

Actively excluding everybody writing in a non Latin based language sends the message that we do not want such people, which is incorrect and quite harsh.

> Am 13.10.2017 um 08:31 schrieb kaputtnik <email address hidden>:
>
> The last spams were are all written in an asian language. I think a
> serious user didn't write that way, because he can see that all posts
> are english/german/spanish/francaise.
>
> Should we try to hide all posts written in a non latin language?
>
> --
> You received this bug notification because you are subscribed to
> Widelands Website.
> https://bugs.launchpad.net/bugs/1614403
>
> Title:
> Ideas to prevent spammers, make their work harder
>
> Status in Widelands Website:
> In Progress
>
> Bug description:
> Also with the new captcha solution we got several spammers from India
> (it seems). I guess these are real human since some people do work for
> a few pennies. We could do nothing to prevent registering such
> accounts, but maybe we could make their work harder.
>
> Some ideas:
> - Allow only new topics if a specific time has left. So if a user creates a new topic, it is not allowed for him to create another new topic for about (f.e.) 1 hour. Maybe this would cause other implications because the spammer has time to find other places where he could write his spam (f.e. wiki)
>
> - Creating a "phrases blacklist": If a post/topic contain one of
> theses phrases prevent saving/posting. Examples for phrases from the
> latest spam floods: "baba ji", "Baba ji", "Tantrik" or just a regex
> catching (telefone) numbers like "+919829791419". The admins should
> then be informed when this happens.
>
> Other ideas welcome :-)
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/widelands-website/+bug/1614403/+subscriptions

Revision history for this message
kaputtnik (franku) wrote :

> That is bad enough, but understandable because our
> community does not have somebody who could lead and
> moderate an Asian sub community. Maybe that person
> shows up eventually though.

Such a person has to be in contact with the english community. So if one asks for an asian sub community he will do that in english. A 'non-latin filter' could then also be deactivated.

> Actively excluding everybody writing in a non Latin based language sends the message that we do not want such people, which is incorrect and quite harsh.

I did not talk about excluding, just hide the post as we do now, showing a message to the user that posts are moderated and have to be reviewed. I don't think that a user feels that this is incorrect and harsh.

Don't know if you knew about the last asian spam attack last Tuesday with 170 new topics in one hour? Only after deleting the user(s) the spamming stops. The hidden asian posts found now were recognized because i modified our keyword spamfilter and the result was only 5 new topics from 5 new users. I am sure if i didn't added the keyword, there were much more spams in asian language today.

By the way, the used usernames for the last two spam attacks have the same schema. For each attack session a username in form of 4 random characters with a number from 1 to 5. Eg. first attack on Monday the usernames 'gkfk[1-5]' are used. We may could find a good rule for valid usernames also?

Revision history for this message
GunChleoc (gunchleoc) wrote :

I agree with SirVer in principle, so unless we get completely overwhelmed by this type of spam, we'd better keep it inclusive. Good rules for usernames are hard too, because people do get creative with them.

If course, it is a lot harder to identify spam keywords if we can't read the script, but I'm still in favour of a keyword approach rather than a blanket non-Latin script approach. The Korean spam I got in my notification inbox is for gambling sites, so adding to the keyword list should be enough:

바카라 - baccarat
온라인 - online
룰렛 - roulette
카지노 - casino

Give my pattern recognizing brain some machine translation, and I will become dangerous *evil grin*

Revision history for this message
SirVer (sirver) wrote :

> Don't know if you knew about the last asian spam attack last Tuesday with 170 new topics in one hour? Only after deleting the user(s) the spamming stops. The hidden asian posts found now were recognized because i modified our keyword spamfilter and the result was only 5 new topics from 5 new users.

That is quite an achievement and shows that your approach works really well for our page! Thanks for your work on this.

+1 for #66.

Revision history for this message
kaputtnik (franku) wrote :

If i put the asian signs into the spam keywords the website responses with a '505 Bad Gateway'.

Revision history for this message
GunChleoc (gunchleoc) wrote :

Sounds like a problem with Unicode support *sigh*

Revision history for this message
SirVer (sirver) wrote :

> If i put the asian signs into the spam keywords the website responses with a '505 Bad Gateway'.

Could you try this on the alpha site? Do we have a proper encoding header in the blacklist file? [1]

> Sounds like a problem with Unicode support *sigh*

Yes, but Python fully supports unicode and so does django. I think we probably just need to define the encoding in the file where we define the blacklist.

[1] https://stackoverflow.com/a/6289494

Revision history for this message
kaputtnik (franku) wrote :

Either i get not noticed about SirVers answer, or i overlooked this.

Anyway i proposed a solution for the asian character problem. Testing locally works. The only thing i stumbled over is to give the file a standard encoding. Giving it asian characters with editor nano it saves it as utf-8, whereas removing the asian characters nano stores it at ASCII again.

Revision history for this message
kaputtnik (franku) wrote :

During deleting the last spammer (all posts hidden, yeah :-)) i found other suspicious users registered today, but did not deleted them. I love the new admin page of auth.user :-)

Since i tested the changes proposed in https://code.launchpad.net/~widelands-dev/widelands-website/settings_unicode/+merge/332381 on alpha, i have made the needed changes in local_settings.py on the productive website, with a backup local_settings.py.org.

I'll leave the merge proposal open, because we may want to add such lines in other files too.

Revision history for this message
kaputtnik (franku) wrote :

Looks like we need another set of korean words... the spam attack today used none of the words from #66

Revision history for this message
GunChleoc (gunchleoc) wrote :

When this happens again, please collect some sample texts for me? There are no hidden posts for me to look at.

Revision history for this message
kaputtnik (franku) wrote :

Seems there were real spam-bots active. Yesterday i deleted a user who made over 220 korean spams, today a user had over 700 spam topics. All got hidden.

We should rethink to limit the topics a user can write per day/hour, to save some traffic and database transactions, IMHO.

Revision history for this message
GunChleoc (gunchleoc) wrote :

Let's go with a 30 second flood limit, which is pretty standard.

Revision history for this message
kaputtnik (franku) wrote :

I saw that new topics were made after a minute or so. Or do i misunderstand 'flood limit'?

For me there is another problem: This spambots never log out. So setting them to inactive will not stop them because the current session is ever valid. The only thing one can do is to delete the user and thus his posts get lost for further examination. We had discussed the 'session' issue earlier, and i think a third person can't easily log a user out or removes his session. See https://stackoverflow.com/questions/953879/how-to-force-user-logout-in-django#954318

What about counting hidden posts per user? Say if user x writes 10 posts which get hidden we can be sure it's not a human. If he writes the 11th post he get logged out. I he logs in again and writes the 12th post he get logged out, set his 'is_active' flag to false (so he can't log in again) and show him a http 505 or http 403. Doing something like that we can at least examine the hidden posts.

Revision history for this message
GunChleoc (gunchleoc) wrote :

1 Minutes is too long to catch by a flood limit then. They probably already know that trick.

I think your solution could work, go for it!

Revision history for this message
kaputtnik (franku) wrote :

Just caught one...

Every 2 to 3 minutes a new topic get created.

Revision history for this message
kaputtnik (franku) wrote :

Looks like the spammers getting smarter... In wondered why my "new topic mails" from today didn't show any korean characters, whereas when opening a post in the forum shows korean characters. Testing such a tesxt against our keywords do match. That's strange... But i think i found what's going on: The spammer creates a topic with unsuspicious content like 'asdasda', saves it and afterwards edit the post to put his spam text in there. Because we currently check only new created posts/topics for spam, the edited posts run through.

To change this behavior i want to pull out the checks for spam for new topics and make them callable from different functions. So we can easily add also a call form the edit_post_ctx (or views/forms from other apps).

I am AFK this weekend, next week in want to work at this.

Revision history for this message
GunChleoc (gunchleoc) wrote :

Sounds like a plan. What a waste of your time *rolleyes*

Revision history for this message
GunChleoc (gunchleoc) wrote :
Changed in widelands-website:
status: In Progress → Won't Fix
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.