Web Hosting Forum | Lunarpages
News: July 14, 2008 - New Contest! - Submit Your WordPress Theme Designs, Win BIG!
June 30, 2008 - Submit Your Site for the July 08 Site of the Month Award!
 
*
Welcome, Guest. Please login or register.
Did you miss your activation email?
July 26, 2008, 04:34:34 AM


Login with username, password and session length


Pages: [1] 2 3 ... 17   Go Down
  Print  
Author Topic: How-to: Train SpamAssassin - Updated May 30 2007  (Read 46407 times)
w98
Galactic Royalty
*****
Offline Offline

Posts: 438



WWW
« on: April 08, 2004, 02:15:29 PM »

UPDATED: May 30, 2007

It seems that newer versions of SpamAssassin will require a new lines in the user_prefs file, which I document below.

sa-trainer.cgi v3.03 is ready for general use

This script will help you to train SpamAssassin to be more effective at keeping spam messages out of your mailboxes. When you have it installed, you can run it whenever you like in your web browser as a cgi-bin script. Once it has scanned a total of 200 spam messages and 200 non-spam messages, you should see a dramatic drop in the amount of spam you get.

This first message of this forum thread has been updated to reflect the new 'quick start' instruction guide, and if you're new to using thie script please be aware that the first 10 pages of replies in this thread were based on older versions of this script and do not apply to these new instructions or version 3.x of this script. Please jump to at least page 11 for new comments and feedback.


What's New in v3.03?
The new version 3 series of the script will autodetect your Email storage type (Maildir vs Mbox), let you configure one combination of many different scenarios of ham/spam scanning, plus handle add-on domains. Version 3 of the script will also do a check against the latest downloadable version at iandouglas.com and alert you when you run the script that a new version is available... no private information is sent, and you can turn off this feature as a configuration option.
Version 3.03 of the script was expected to let you build either an exclusive list of only the users who you want to scan, or an exclusive list of users you want to specifically exclude from scanning, but I needed to fix a few bugs, so that option is not yet available. Maybe for v3.04.

CHANGELOG between v3.02 and v3.03
- bug fix: how globalham@yourdomain.com is detected
- bug fix: how add-on domains are handled
- enabled the code for the callback to iandouglas.com to check if the version f the script you're using is the newest version
- added some code to detect if you've configured the script at all before trying to run the script

To Make the v3 Script work the same as the last v2 version
- comment out the $global_ham_email line
- uncomment the $user_hambox line and set it to "myham"
- set $user_spambox to "myspam"

DISCLAIMERS:
Disclaimer #1: Being the guy that wrote this script back in 2001 and have been hacking at it ever since, and posting it here in April 2004, this script works amazingly well for me. Your mileage, of course, may vary.
Disclaimer #2: LunarPages has given me permission to post this information and quick start guide with with the following notes:
Quote
please include a warning that it is the user's own responsibility to mess with it :)
and
Quote
(paraphrased) Please announce that all LunarPages users should consider this message thread as the primary source of support for sa-trainer.cgi
I fully intend to hang out here (since I'm also a LunarPages user myself) to support this script till the end of time, so I'm happy to comply.
Disclaimer #3: While I have tried very hard to document this as carefully as I can and use 'best practice' software development efforts, some errors are bound to happen, so there are NO guarantees on these instructions whatsoever. However, numerous LunarPages users use my script on a regular basis and have seen dramatic drops in the amount of spam in their Inbox.
Disclaimer #4: The new script (starting at v3.02) and full supporting documentation is located at iandouglas.com (linked later within the quick-start guide for the download link, and linked again at the bottom of this message). LunarPages support staff have been awesome about letting me move the script out of this message thread (since the script is too big to fit in a single message here now). Just be aware that viewing the full documentation and downloading the script itself will take you away from LunarPages and LunarForums.com. Please do return here for support if you are a LunarPages user -- I promised LP that I'd always be available to this forum for assistance.[/i]

THE NEW sa-trainer.cgi QUICK-START GUIDE
Here are some very general instructions for how to set up SpamAssassin in CPanel and configuring the final details, downloading and installing the script, and getting it running. These instructions will teach you to do the following:
  • create an Email account called globalham@yourdomain.com for your users to forward their non-spam messages to
  • you will enable the CPanel "spam box" option, and scan each individual user's spam mailbox

Assumptions I Need to Make about You
If you want to take the simplest approach, and use the default behavior of this script:
  • that you know how to log in to CPanel using your LunarPages or other hosting account details
  • that you know how to create a new mailbox for your primary domain in CPanel
  • that you can save a copy of my script on your local computer, change it in a text editor like Notepad or TextEdit (not a word processor like MS Word), and save the file
  • that you know how to use an FTP program to upload a copy of the script to your hosting account
  • if you have multiple users with mailboxes through your account, that you can communicate effectively with your users to clean up their own mailboxes once you've finished running this training script
If you want to use more advanced features of this script:
  • that you can do all of the above, and know how to search through the configuration settings within the script to make changes to suit your needs
  • if you download your Email through a third-party software (Outlook, Outlook Express, Thunderbird, Eudora, etc) that you are familiar enough with that software to add an IMAP account or profile
  • or, if you always use webmail such as Squirrelmail or Horde, that you are familiar enough with using the software to move or copy messages to other folders

Terminology You Need to Learn
SPAM: unsolicited Emails that you've received that want you to buy something or contain adult-themed references that you'd rather not get anymore.
HAM: non-spam, legitimate Emails from friends, family, newsletters, and so on
SA: short for SpamAssassin
False-Positive: this is a non-spam (HAM) message that SA flagged as SPAM and ended up in our spam box.
False-Negative: this is a SPAM message that SA flagged as non-spam (HAM) that ended up in our Inbox
IMAP: this is an Email protocol used to send/receive Email messages from your hosting account. Generally, IMAP will leave a copy of downloaded messages on the server instead of downloading them to your computer and deleting the server's copy.
"spam/ham folder pair": this is a set of mail folders (which may actually be files instead of folders) that we will set up and use to store copies of messages to train SpamAssassin with.
primary domain: the first (or only) domain name configured for your CPanel account, not an add-on or parked domain added later.

NOTE: for all examples in the setup and the script itself, the account name I will use is myaccount. The primary domain for my CPanel account is mydomain.com. I'll do my best to keep these terms bolded throughout this text to highlight where you'll need to insert your own information.

Configure CPanel to turn SpamAssassin on

Login to your CPanel intferface, click on the 'Mail' icon, click on the link for 'SpamAssassin'.
Click on the 'Enable SpamAssassin' button, click on the 'go back' link.
Click on the 'Enable Spam Box', click on the 'go back' link.
Click on the 'go back' link again so you're back at the 'Mail' icon menu list where you clicked on 'SpamAssassin'
Click on 'Add/Remove/Manage Accounts'
Click on 'Add Account' link at the bottom
Set the Email account as 'globalham' at your primary domain name, set a password, and set a reasonable quota based on your usage, such as 100MB or 200MB. Click the 'Create' button
Click the 'Go Back' link

Create/Edit /home/myaccount/.spamassassin/user_prefs

In Cpanel, click on the File Manager icon
Click on the folder next to the ".spamassassin" folder link
If "user_prefs" doesn't already exist, click on the "Create New File" link, call the file "user_prefs" and specify that it is a Text Document, and click the Create button.
Click on the filename link for "user_prefs", and in the top-right corner of the screen, select to edit the file.
Replace the entire contents of the file with this text:
Quote
use_bayes   1
required_hits   3.5
rewrite_subject   1
subject_tag   {SPAM _SCORE(0)_}
bayes_path   /home/myaccount/.spamassassin/bayes
bayes_file_mode   0600
bayes_ignore_header X-MailScanner
bayes_ignore_header X-MailScanner-SpamCheck
bayes_ignore_header X-MailScanner-SpamScore
bayes_ignore_header X-MailScanner-Information
... be sure to replace "myaccount" with your actual CPanel username, and click the 'Save' button

Getting the sa-trainer.cgi Script
Download the latest release copy of the sa-trainer.cgi script from my Downloads area: http://iandouglas.com/download.php?list.7
This will take you away from LunarForums.com, but is the preferred method for getting a stable copy of the script. On that page, click the blue 'down' arrow to start the download.

Configuring the Script
Edit the script in a text editor like Notepad or TextEdit, NOT in a word processor like MS Word.
Set $cpanel_username to your primary FTP login username for your CPanel account, instead of the default of "myaccount".
Set $my_domain value to be your primary domain name associated with your CPanel account, instead of the default "mydomain.com".
Save the script and close the text editor.

Upload sa-trainer.cgi to your hosting account
Use your favorite FTP program, upload it in ASCII mode into your /www/cgi-bin/ folder, and set the permission bits (chmod) to be 755. The script likely will not run without this.
*** I recommend renaming the script to some other name other than sa-trainer.cgi (like: my-spam-trainer.cgi or anything with a .cgi file extension) to avoid any security problems of people knowing you run this script in case any bugs are found that could be exploited (though I haven't found any myself, nor have any been reported to me, in the past three years).

If you do not have an FTP program, you can open the script in Notepad or TextEdit again, copy the entire contents to your clipboard, and do the following:
in CPanel, click on the File Manager icon
click on the yellow folder beside "public_html" or "www" (they both go to the same place)
click on the yellow folder beside "cgi-bin"
click on the link to "create new file"
In the top right corner of the screen, specify to create a new text document called "sa-trainer.cgi" (or some other filename to avoid any security issues) and click the Create button
In the new window that pops up, paste the contents of the script into the space provided, and click the 'save' button at the bottom, then close the pop-up window
Back in the File Manager window, click on the filename (which is a link) f or sa-trainer.cgi
Click on the link to set the permissions of the script, and select the 'execute' bit for all 3 columns so the permissions number reads '755' and click the 'change' button.

Have some recent spam/ham available to train with
Once you have some spam and ham messages available in the mailboxes you configured, simply call your script in your web browser, like "http://www.mydomain.com/cgi-bin/sa-trainer.cgi" (or whatever you called your copy of the script).

Ongoing maintenance
1. Teach your users to forward non-spam messages to globalham@mydomain.com, with a disclaimer that no human eyes will ever see the mailbox (you could be found liable for reading their private messages, so be sure you're not secretly peeking in there...). Instruct them not to forward messages over 100kb or with file attachments, as these can confuse SpamAssassin and slow down the scanning.

2. Once scanning is complete, empty the Inbox for the globalham@yourdomain.com account - the easiest and quickest way to avoid any legal/privacy concerns would be to completely delete the mailbox from CPanel and rebuild it.

3. You will also need to instruct your users to empty their spam boxes once scanning is complete. To do this, they can highlight/select all of their spam messages in the 'spam' folder, and use the delete function of their webmail/Email client software.

Did I forget anything?
Be sure to notify me if I've neglected to describe any step along the way.

Full Documentation
A MUCH larger version of this documentation is available at http://iandouglas.com/page.php?3.0 You will probably need an hour or more to read through it (told you it was huge), but it goes much deeper into the configurable options of the script.

And as stated in a few other places here: if you are a LunarPages customer, this forum message thread that you're reading right now is your primary means of support for this script so please post messages here if you have questions or problems with the script.

Like it? Love it? Need extra help?
I have placed a PayPal donation button on the opening page of the full documentation page at iandouglas.com, in case you want to forward a 'thank-you' to spend at Starbucks to keep me caffeinated enough to keep working on this script, or a second donation button if you want personalized support from me to talk to you via telephone to configure the script and install it for you on your account.

Good luck, and happy spam fighting!
« Last Edit: May 30, 2007, 05:33:29 PM by w98 » Logged

Danielle
Resident Alien
Administrator
Berserker Poster
*****
Offline Offline

Posts: 8877


nihil sunt omnia


WWW
« Reply #1 on: April 08, 2004, 02:18:04 PM »

Hi Ian,

I changed the thread from a normal post to a sticky since I think this is great.  I may also place it in the how-to section, since again this is great.  Thumbs Up

Have a Blessed Day
Logged

Danielle Wallace
- nihil sunt omnia -
Lunarpages Webhosting ~ Lunarpages Forums ~ Lunarpages Affiliates
Administrator Training Manager - System Administrator Team


Ruby Asylum - For those crazy about Ruby
A&E Writing Forum ~ Best Garden ~ Endar & Endar Gallery ~ RatingBar.com

Every living creature on this earth dies alone.
w98
Galactic Royalty
*****
Offline Offline

Posts: 438



WWW
« Reply #2 on: April 08, 2004, 02:20:32 PM »

w00t, many thanks Thumbs Up

-id
Logged

Lopht
Intergalactic Cowboy
*****
Offline Offline

Posts: 50


WWW
« Reply #3 on: April 09, 2004, 07:37:48 AM »

this rocks, thanks Ian.
Logged
Lopht
Intergalactic Cowboy
*****
Offline Offline

Posts: 50


WWW
« Reply #4 on: April 09, 2004, 07:55:21 AM »

one question, the paths put into the sa-learn.cgi reflect the account I use to log into cpanel with, correct? Do I have to have entried in this file for each mail account in the domain?

print `$salearn -p /home/lopht01/.spamassassin/user_prefs --mbox --spam --showdots /home/lopht01/mail/domain/user/myspam

print `$salearn -p /home/lopht01/.spamassassin/user_prefs --mbox --ham --showdots /home/lopht01/mail/domain/user/myham
Logged
w98
Galactic Royalty
*****
Offline Offline

Posts: 438



WWW
« Reply #5 on: April 09, 2004, 10:19:36 AM »

Heya Lopht,

Yes, you have that set up right, to train SA for spam/ham for various other mailboxes, that would work just fine.

As long as the user_prefs file points to your domain's bayes database files, it'll train everything into those databases. The down side with this, of course, is if one user thinks a message is SPAM while another thinks it's HAM, and they both try to train SA - SA will only remember the last way you trained it when it looked at a particular message.

So if user A and user B get a copy of message X from a newsletter, and user A trains it as SPAM and user B trains it as HAM, user A will continue to get the newsletter because user B re-trained SA using your overall domain bayesian database.

I don't know if we could set up SA for each individual Email account, that would take a lot of extra configuration, especially from LP's point of view, as well as using up a lot more of your disk quota for managing it on a per-user basis.

Just an extra $0.02. Thanks for the feedback.

-id
Logged

Lopht
Intergalactic Cowboy
*****
Offline Offline

Posts: 50


WWW
« Reply #6 on: April 09, 2004, 10:38:32 AM »

Fortunately I am the only 'user' on my domain. Wink

One thing I didn't see in the howto, was to give the cgi script execute permission.

One oddity, when I first ran it with 4 messages in the myspam folder and none in the myham folder, it said it learned from 5 messages in myspam and 1 in myham. I'll assume that's just the "this is administrative data for thie folder, do not delete" message that some readers don't display and some do?'

Is there a programmatic way to also empty those folders after processing them?
Logged
w98
Galactic Royalty
*****
Offline Offline

Posts: 438



WWW
« Reply #7 on: April 09, 2004, 10:49:36 AM »

Yeah, you *could* set the Perl script to remove the messages in there with a library like Mail::Box, that's how I did it on my system after processing the messages. The only 'gotcha' with that is that it's been known to totally erase the mailbox once it's deleted all of the messages, so you'd have to possibly find another Perl library to recreate the IMAP folder if it gets deleted. Although, just doing a 'touch' system call should create it ... hmm (pondering)

And yes, the extra message it counted is the message abpit "administrative data, do not delete"

I had edited the main message yesterday to include the 755 permission notes, but it's not there this morning, so I've added it again. Embarassed

-id
Logged

Lopht
Intergalactic Cowboy
*****
Offline Offline

Posts: 50


WWW
« Reply #8 on: April 09, 2004, 10:56:24 AM »

heh, one last thing, the example URL you give to run it says sa-train.cgi, but everywhere else it says sa-learn.cgi.

A quick an dirty way to "empty" the mailbox would be

system ("echo \"\" > /home/lpaccount/mail/myham");
system ("echo \"\" > /home/lpaccount/mail/myspam");

Or if the admin data is really needed, just copy/paste it into the perl script as a multi-line string and echo that. No real need to bring in a whole module. Wink
Logged
w98
Galactic Royalty
*****
Offline Offline

Posts: 438



WWW
« Reply #9 on: April 09, 2004, 10:58:28 AM »

Oops, my bad. I started out calling it "train.cgi" then changed to "sa-learn.cgi", I'll update that now.

I tried the system call to echo a blank line into the mailboxes, but then it adds an empty message into each mailbox... SpamAssassin ignores it though, so I went ahead and added the two lines at the end of the script above... and gave credit where it's due Thumbs Up

-id
Logged

w98
Galactic Royalty
*****
Offline Offline

Posts: 438



WWW
« Reply #10 on: April 09, 2004, 11:14:17 AM »

Hmm... having the system call redirect a blank line to the mail folders seems to work despite a 600 permission on the mailbox file itself.

I got Perl to open the file and write nothing into it and then closing it ... by not using system("echo"), it doesn't write a blank message into the mailbox, effectively clearing it right out.

-id
Logged

w98
Galactic Royalty
*****
Offline Offline

Posts: 438



WWW
« Reply #11 on: April 09, 2004, 11:24:00 AM »

Okay, made a few edits to the script itself on a whole to use a $basepath variable and a $configfile variable to make the script quicker to edit for users.

-id
Logged

w98
Galactic Royalty
*****
Offline Offline

Posts: 438



WWW
« Reply #12 on: April 10, 2004, 10:10:04 AM »

One more thing to note if you feel that SpamAssassin is no better after following this how-to:

Bayesian filtering won't "kick in" until SpamAssassin sees that you've trained a 200 spam and 200 ham messages.

I'm still trying to determine whether LunarPages even uses our individual bayesian databases when forwarding Emails for our accounts. They DO use our personal user_prefs file (although custom rules seem to be ignored), but there are still a few unknowns as yet.

-id
Logged

Lopht
Intergalactic Cowboy
*****
Offline Offline

Posts: 50


WWW
« Reply #13 on: April 11, 2004, 09:19:43 PM »

I've noticed that since adding this, SA hasn't flagged a single message as spam, is that what you meant with the last note?  My bayes_toks file is almost 5 MB, so it is updating based on the spam/ham I'm giving it.
Logged
w98
Galactic Royalty
*****
Offline Offline

Posts: 438



WWW
« Reply #14 on: April 11, 2004, 09:31:29 PM »

Well, I'm waiting to hear back from Max now whether their server configurations tell SA to look at our personal bayes databases when delivering mail. But yes, that's what I meant. You have to train at least 200 ham and 200 spam for SA to kick in. Kind of a lot, I know, but WELL worth the effort once trained.

-id
Logged

Pages: [1] 2 3 ... 17   Go Up
  Print  
 
Jump to:  

Powered by MySQL Powered by PHP Powered by SMF 1.1.3 | SMF © 2006-2007, Simple Machines LLC
Seo4Smf v0.2 © Webmaster's Talks


Valid XHTML 1.0! Valid CSS! Dilber MC Theme by HarzeM