Web Hosting Forum | Lunarpages


*
Welcome, Guest. Please login or register.
Did you miss your activation email?



Login with username, password and session length
July 25, 2014, 12:01:34 PM

Pages: [1] 2 3 ... 19   Go Down
  Print  
Author Topic: How-to: Train SpamAssassin - Updated April 27, 2010  (Read 135103 times)
w98
Galactic Royalty
*****
Offline Offline

Posts: 441



WWW
« on: April 08, 2004, 01:15:29 PM »

UPDATED: January 2014

v4.02 is released.

New in v4: much nicer HTML layout (bootstrap, nav menu with links to documentation, email and my new Google Helpout live support option)

I've built a "Build Your Own SpamAssassin Trainer" web app that you can use that will ask a few simple questions and generate everything you'll need for the Perl script. I'll be making other changes to the Perl script this spring that should improve performance.

Documentation: please visit http://iandouglas.com/spamassassin-trainer/

Build your trainer script here: http://iandouglas.com/sa-trainer/

Using the /sa-trainer/ link will let you configure your script in a web page using some simple prompts, and build a .zip file for you.


DISCLAIMERS:
Disclaimer #1: Being the guy that wrote this script back in 2001 and have been hacking at it ever since, and posting it here in April 2004, this script works amazingly well for me. Your mileage, of course, may vary.
Disclaimer #2: LunarPages has given me permission to post this information and quick start guide with with the following notes:
Quote
please include a warning that it is the user's own responsibility to mess with it :)
and
Quote
(paraphrased) Please announce that all LunarPages users should consider this message thread as the primary source of support for sa-trainer.cgi
I fully intend to hang out here (since I'm also a LunarPages user myself) to support this script till the end of time, so I'm happy to comply.
Disclaimer #3: While I have tried very hard to document this as carefully as I can and use 'best practice' software development efforts, some errors are bound to happen, so there are NO guarantees on these instructions whatsoever. However, numerous LunarPages users use my script on a regular basis and have seen dramatic drops in the amount of spam in their Inbox.
Disclaimer #4: The new script (starting at v3.02) and full supporting documentation is located at iandouglas.com (linked later within the quick-start guide for the download link, and linked again at the bottom of this message). LunarPages support staff have been awesome about letting me move the script out of this message thread (since the script is too big to fit in a single message here now). Just be aware that viewing the full documentation and downloading the script itself will take you away from LunarPages and LunarForums.com. Please do return here for support if you are a LunarPages user -- I promised LP that I'd always be available to this forum for assistance.[/i]

THE NEW sa-trainer.cgi QUICK-START GUIDE
Here are some very general instructions for how to set up SpamAssassin in CPanel and configuring the final details, downloading and installing the script, and getting it running. These instructions will teach you to do the following:
  • create an Email account called globalham@yourdomain.com for your users to forward their non-spam messages to
  • you will enable the CPanel "spam box" option, and scan each individual user's spam mailbox

Assumptions I Need to Make about You
If you want to take the simplest approach, and use the default behavior of this script:
  • that you know how to log in to CPanel using your LunarPages or other hosting account details
  • that you know how to create a new mailbox for your primary domain in CPanel
  • that you can save a copy of my script on your local computer, change it in a text editor like Notepad or TextEdit (not a word processor like MS Word), and save the file
  • that you know how to use an FTP program to upload a copy of the script to your hosting account
  • if you have multiple users with mailboxes through your account, that you can communicate effectively with your users to clean up their own mailboxes once you've finished running this training script
If you want to use more advanced features of this script:
  • that you can do all of the above, and know how to search through the configuration settings within the script to make changes to suit your needs
  • if you download your Email through a third-party software (Outlook, Outlook Express, Thunderbird, Eudora, etc) that you are familiar enough with that software to add an IMAP account or profile
  • or, if you always use webmail such as Squirrelmail or Horde, that you are familiar enough with using the software to move or copy messages to other folders

Terminology You Need to Learn
SPAM: unsolicited Emails that you've received that want you to buy something or contain adult-themed references that you'd rather not get anymore.
HAM: non-spam, legitimate Emails from friends, family, newsletters, and so on
SA: short for SpamAssassin
False-Positive: this is a non-spam (HAM) message that SA flagged as SPAM and ended up in our spam box.
False-Negative: this is a SPAM message that SA flagged as non-spam (HAM) that ended up in our Inbox
IMAP: this is an Email protocol used to send/receive Email messages from your hosting account. Generally, IMAP will leave a copy of downloaded messages on the server instead of downloading them to your computer and deleting the server's copy.
"spam/ham folder pair": this is a set of mail folders (which may actually be files instead of folders) that we will set up and use to store copies of messages to train SpamAssassin with.
primary domain: the first (or only) domain name configured for your CPanel account, not an add-on or parked domain added later.

NOTE: for all examples in the setup and the script itself, the account name I will use is myaccount. The primary domain for my CPanel account is mydomain.com. I'll do my best to keep these terms bolded throughout this text to highlight where you'll need to insert your own information.

Configure CPanel to turn SpamAssassin on

Login to your CPanel intferface, click on the 'Mail' icon, click on the link for 'SpamAssassin'.
Click on the 'Enable SpamAssassin' button, click on the 'go back' link.
Click on the 'Enable Spam Box', click on the 'go back' link.
Click on the 'go back' link again so you're back at the 'Mail' icon menu list where you clicked on 'SpamAssassin'
Click on 'Add/Remove/Manage Accounts'
Click on 'Add Account' link at the bottom
Set the Email account as 'globalham' at your primary domain name, set a password, and set a reasonable quota based on your usage, such as 100MB or 200MB. Click the 'Create' button
Click the 'Go Back' link

Create/Edit /home/myaccount/.spamassassin/user_prefs

In Cpanel, click on the File Manager icon
Click on the folder next to the ".spamassassin" folder link
If "user_prefs" doesn't already exist, click on the "Create New File" link, call the file "user_prefs" and specify that it is a Text Document, and click the Create button.
Click on the filename link for "user_prefs", and in the top-right corner of the screen, select to edit the file.
Replace the entire contents of the file with this text:
Quote
use_bayes   1
required_hits   3.5
rewrite_subject   1
subject_tag   {SPAM _SCORE(0)_}
bayes_path   /home/myaccount/.spamassassin/bayes
bayes_file_mode   0600
bayes_ignore_header X-MailScanner
bayes_ignore_header X-MailScanner-SpamCheck
bayes_ignore_header X-MailScanner-SpamScore
bayes_ignore_header X-MailScanner-Information
... be sure to replace "myaccount" with your actual CPanel username, and click the 'Save' button

Getting the sa-trainer.cgi Script
Build everything you need at http://iandouglas.com/sa-trainer/
This will take you away from LunarForums.com, but is the preferred method for getting a stable copy of the script. On that page, follow the instructions and download the .zip file it creates for you.

Upload sa-trainer.cgi to your hosting account
Use your favorite FTP program, upload it in ASCII mode into your /www/cgi-bin/ folder, and set the permission bits (chmod) to be 755. The script likely will not run without this.
*** I recommend renaming the script to some other name other than sa-trainer.cgi (like: my-spam-trainer.cgi or anything with a .cgi file extension) to avoid any security problems of people knowing you run this script in case any bugs are found that could be exploited (though I haven't found any myself, nor have any been reported to me, in the past three years).

If you do not have an FTP program, you can open the script in Notepad or TextEdit again, copy the entire contents to your clipboard, and do the following:
in CPanel, click on the File Manager icon
click on the yellow folder beside "public_html" or "www" (they both go to the same place)
click on the yellow folder beside "cgi-bin"
click on the link to "create new file"
In the top right corner of the screen, specify to create a new text document called "sa-trainer.cgi" (or some other filename to avoid any security issues) and click the Create button
In the new window that pops up, paste the contents of the script into the space provided, and click the 'save' button at the bottom, then close the pop-up window
Back in the File Manager window, click on the filename (which is a link) f or sa-trainer.cgi
Click on the link to set the permissions of the script, and select the 'execute' bit for all 3 columns so the permissions number reads '755' and click the 'change' button.

Have some recent spam/ham available to train with
Once you have some spam and ham messages available in the mailboxes you configured, simply call your script in your web browser, like "http://www.mydomain.com/cgi-bin/sa-trainer.cgi" (or whatever you called your copy of the script).

Ongoing maintenance
1. Teach your users to forward non-spam messages to globalham@mydomain.com, with a disclaimer that no human eyes will ever see the mailbox (you could be found liable for reading their private messages, so be sure you're not secretly peeking in there...). Instruct them not to forward messages over 100kb or with file attachments, as these can confuse SpamAssassin and slow down the scanning.

2. Once scanning is complete, empty the Inbox for the globalham@yourdomain.com account - the easiest and quickest way to avoid any legal/privacy concerns would be to completely delete the mailbox from CPanel and rebuild it.

3. You will also need to instruct your users to empty their spam boxes once scanning is complete. To do this, they can highlight/select all of their spam messages in the 'spam' folder, and use the delete function of their webmail/Email client software.

Did I forget anything?
Be sure to notify me if I've neglected to describe any step along the way.

Full Documentation
A MUCH larger version of this documentation is available at http://iandouglas.com/spamassassin-trainer/ You will probably need an hour or more to read through it (told you it was huge), but it goes much deeper into the configurable options of the script.

And as stated in a few other places here: if you are a LunarPages customer, this forum message thread that you're reading right now is your primary means of support for this script so please post messages here if you have questions or problems with the script.

Like it? Love it? Need extra help?
Checkout my live support option on Google Helpouts now: https://helpouts.google.com/113763167140406107715/ls/bbcf42fd8de12842

Good luck, and happy spam fighting!
« Last Edit: February 06, 2014, 08:39:20 PM by w98 » Logged

Danielle
Guest
« Reply #1 on: April 08, 2004, 01:18:04 PM »

Hi Ian,

I changed the thread from a normal post to a sticky since I think this is great.  I may also place it in the how-to section, since again this is great.  Thumbs Up

Have a Blessed Day
Logged
w98
Galactic Royalty
*****
Offline Offline

Posts: 441



WWW
« Reply #2 on: April 08, 2004, 01:20:32 PM »

w00t, many thanks Thumbs Up

-id
Logged

Lopht
Intergalactic Cowboy
*****
Offline Offline

Posts: 50


WWW
« Reply #3 on: April 09, 2004, 06:37:48 AM »

this rocks, thanks Ian.
Logged
Lopht
Intergalactic Cowboy
*****
Offline Offline

Posts: 50


WWW
« Reply #4 on: April 09, 2004, 06:55:21 AM »

one question, the paths put into the sa-learn.cgi reflect the account I use to log into cpanel with, correct? Do I have to have entried in this file for each mail account in the domain?

print `$salearn -p /home/lopht01/.spamassassin/user_prefs --mbox --spam --showdots /home/lopht01/mail/domain/user/myspam

print `$salearn -p /home/lopht01/.spamassassin/user_prefs --mbox --ham --showdots /home/lopht01/mail/domain/user/myham
Logged
w98
Galactic Royalty
*****
Offline Offline

Posts: 441



WWW
« Reply #5 on: April 09, 2004, 09:19:36 AM »

Heya Lopht,

Yes, you have that set up right, to train SA for spam/ham for various other mailboxes, that would work just fine.

As long as the user_prefs file points to your domain's bayes database files, it'll train everything into those databases. The down side with this, of course, is if one user thinks a message is SPAM while another thinks it's HAM, and they both try to train SA - SA will only remember the last way you trained it when it looked at a particular message.

So if user A and user B get a copy of message X from a newsletter, and user A trains it as SPAM and user B trains it as HAM, user A will continue to get the newsletter because user B re-trained SA using your overall domain bayesian database.

I don't know if we could set up SA for each individual Email account, that would take a lot of extra configuration, especially from LP's point of view, as well as using up a lot more of your disk quota for managing it on a per-user basis.

Just an extra $0.02. Thanks for the feedback.

-id
Logged

Lopht
Intergalactic Cowboy
*****
Offline Offline

Posts: 50


WWW
« Reply #6 on: April 09, 2004, 09:38:32 AM »

Fortunately I am the only 'user' on my domain. Wink

One thing I didn't see in the howto, was to give the cgi script execute permission.

One oddity, when I first ran it with 4 messages in the myspam folder and none in the myham folder, it said it learned from 5 messages in myspam and 1 in myham. I'll assume that's just the "this is administrative data for thie folder, do not delete" message that some readers don't display and some do?'

Is there a programmatic way to also empty those folders after processing them?
Logged
w98
Galactic Royalty
*****
Offline Offline

Posts: 441



WWW
« Reply #7 on: April 09, 2004, 09:49:36 AM »

Yeah, you *could* set the Perl script to remove the messages in there with a library like Mail::Box, that's how I did it on my system after processing the messages. The only 'gotcha' with that is that it's been known to totally erase the mailbox once it's deleted all of the messages, so you'd have to possibly find another Perl library to recreate the IMAP folder if it gets deleted. Although, just doing a 'touch' system call should create it ... hmm (pondering)

And yes, the extra message it counted is the message abpit "administrative data, do not delete"

I had edited the main message yesterday to include the 755 permission notes, but it's not there this morning, so I've added it again. Embarassed

-id
Logged

Lopht
Intergalactic Cowboy
*****
Offline Offline

Posts: 50


WWW
« Reply #8 on: April 09, 2004, 09:56:24 AM »

heh, one last thing, the example URL you give to run it says sa-train.cgi, but everywhere else it says sa-learn.cgi.

A quick an dirty way to "empty" the mailbox would be

system ("echo \"\" > /home/lpaccount/mail/myham");
system ("echo \"\" > /home/lpaccount/mail/myspam");

Or if the admin data is really needed, just copy/paste it into the perl script as a multi-line string and echo that. No real need to bring in a whole module. Wink
Logged
w98
Galactic Royalty
*****
Offline Offline

Posts: 441



WWW
« Reply #9 on: April 09, 2004, 09:58:28 AM »

Oops, my bad. I started out calling it "train.cgi" then changed to "sa-learn.cgi", I'll update that now.

I tried the system call to echo a blank line into the mailboxes, but then it adds an empty message into each mailbox... SpamAssassin ignores it though, so I went ahead and added the two lines at the end of the script above... and gave credit where it's due Thumbs Up

-id
Logged

w98
Galactic Royalty
*****
Offline Offline

Posts: 441



WWW
« Reply #10 on: April 09, 2004, 10:14:17 AM »

Hmm... having the system call redirect a blank line to the mail folders seems to work despite a 600 permission on the mailbox file itself.

I got Perl to open the file and write nothing into it and then closing it ... by not using system("echo"), it doesn't write a blank message into the mailbox, effectively clearing it right out.

-id
Logged

w98
Galactic Royalty
*****
Offline Offline

Posts: 441



WWW
« Reply #11 on: April 09, 2004, 10:24:00 AM »

Okay, made a few edits to the script itself on a whole to use a $basepath variable and a $configfile variable to make the script quicker to edit for users.

-id
Logged

w98
Galactic Royalty
*****
Offline Offline

Posts: 441



WWW
« Reply #12 on: April 10, 2004, 09:10:04 AM »

One more thing to note if you feel that SpamAssassin is no better after following this how-to:

Bayesian filtering won't "kick in" until SpamAssassin sees that you've trained a 200 spam and 200 ham messages.

I'm still trying to determine whether LunarPages even uses our individual bayesian databases when forwarding Emails for our accounts. They DO use our personal user_prefs file (although custom rules seem to be ignored), but there are still a few unknowns as yet.

-id
Logged

Lopht
Intergalactic Cowboy
*****
Offline Offline

Posts: 50


WWW
« Reply #13 on: April 11, 2004, 08:19:43 PM »

I've noticed that since adding this, SA hasn't flagged a single message as spam, is that what you meant with the last note?  My bayes_toks file is almost 5 MB, so it is updating based on the spam/ham I'm giving it.
Logged
w98
Galactic Royalty
*****
Offline Offline

Posts: 441



WWW
« Reply #14 on: April 11, 2004, 08:31:29 PM »

Well, I'm waiting to hear back from Max now whether their server configurations tell SA to look at our personal bayes databases when delivering mail. But yes, that's what I meant. You have to train at least 200 ham and 200 spam for SA to kick in. Kind of a lot, I know, but WELL worth the effort once trained.

-id
Logged

Pages: [1] 2 3 ... 19   Go Up
  Print  
 
Jump to: