Web Hosting Forum | Lunarpages


*
Welcome, Guest. Please login or register.
Did you miss your activation email?



Login with username, password and session length
August 23, 2014, 01:32:45 AM

Pages: [1]   Go Down
  Print  
Author Topic: Search Engines and my files  (Read 8903 times)
dkteddy
Trekkie
**
Offline Offline

Posts: 13


« on: June 15, 2012, 07:22:00 AM »

I've been trying for a long time to not let search engines scan my documents that I have on my website. I open a ticket with Lunarpages and they suggested to modify my .htaccess file. I did modify it but still search engines can still look at my file's content. For instance, a PDF file that I have secure, it is still accessible by just doing a search that contains some lines of the PDF file.

I would like to prevent that from happening. How can I do this? I also created and empty index.php and index.html so that users don't have access to my files in my upload folder , but file are still searchable ...

I need help badly. Lunarpages suggested that I post my questi4on here and let the community help me.

Thank you.
Logged
MrPhil
Senior Moderator
Berserker Poster
*****
Offline Offline

Posts: 5863



« Reply #1 on: June 15, 2012, 09:51:02 AM »

Are these documents readable from your pages, without user passwords? Once a search engine spider can see a link to the document, there's nothing to keep it from following that link and looking at the document. You can set up a robots.txt file to tell search engines not to index your documents (files ending in .pdf), but that won't keep out "ill mannered" ones like Baidu. I can think of several ways to deny access to these files. One is to put the documents in a password-protected directory. Your members/users would have to be issued IDs and passwords to access this directory. They could still follow a link to the file, but would have to give their ID and password to actually download it. You can use some sort of access control application (e.g., a forum such as SMF) to control access to the directory holding the documents, and "hash" the document name so that no one can guess it coming in from the outside. Search engines would not see the links, and only logged-in members can see them. Your application would have to present the original file name to the user and present the document content, while still hiding the hashed name. If you can't trust your members not to redistribute IDs and passwords, then all of this is for naught.
Logged

Visit My Site

E-mail Me
  
-= From the ashes shall rise a sooty tern =-
dkteddy
Trekkie
**
Offline Offline

Posts: 13


« Reply #2 on: June 19, 2012, 09:29:09 AM »

Ok. I checked the uploads folder and other folders that have my files and they dont have the robots.txt file I would like to start with that first.

Where should I put it and what lines do I need to write?

Thank you.
Logged
MrPhil
Senior Moderator
Berserker Poster
*****
Offline Offline

Posts: 5863



« Reply #3 on: June 19, 2012, 11:49:12 AM »

You could start with this in /robots.txt:
Code:
# keep everyone out of my PDF files
User-agent: *
Disallow: /*.pdf$

I think this should work, although you may have to play with it a bit, as not all bots recognize the same syntax. Also remember that this is only advisory -- a bot is free to ignore this, and might even use it to find juicy stuff.

I would suggest googling for robots.txt and get some instruction on how to write robots.txt files.
Logged

Visit My Site

E-mail Me
  
-= From the ashes shall rise a sooty tern =-
jany367
Spaceship Navigator
*****
Offline Offline

Posts: 79


WWW
« Reply #4 on: June 22, 2012, 05:50:46 AM »

Robots.txt file is not always reliable. In major cases it happens before you add something in that search engine crawl that. in your case i would suggest 1st Add code given by MrPhil in your robots.txt file and than request for that file removal from webmasters tool so search engine can not crawl it again.
Logged

dkteddy
Trekkie
**
Offline Offline

Posts: 13


« Reply #5 on: July 09, 2012, 10:28:02 AM »

Well, I have tried many things and they search engines can still browse my pdf files. I blocked pdf files, documents, spreasheets, etc, but search engines are still able to scan the contents of the PDF. The robots file and the htacess file seem to do nothing when it comes to all of this.

Any free tools that I can use to protect my files?
Logged
MrPhil
Senior Moderator
Berserker Poster
*****
Offline Offline

Posts: 5863



« Reply #6 on: July 09, 2012, 02:38:09 PM »

robots.txt ought to be able to block all "well behaved" search engines (that obey robots.txt). You may have to do some research and trial and error (experimentation). Your robots.txt file is readable (644 or 444 permissions)? It's in your site root (/)?
Logged

Visit My Site

E-mail Me
  
-= From the ashes shall rise a sooty tern =-
dkteddy
Trekkie
**
Offline Offline

Posts: 13


« Reply #7 on: July 10, 2012, 07:40:36 AM »

The robots.txt file is in the public_html folder and it has 6 4 4 permissions.
Logged
MrPhil
Senior Moderator
Berserker Poster
*****
Offline Offline

Posts: 5863



« Reply #8 on: July 10, 2012, 08:57:26 AM »

Are the major search engines still cataloging your PDF files? This would include Google, Yahoo, Bing, etc. They have a reputation as being "well behaved". With the robots.txt, are they indexing new PDF files, or has that stopped, and only the older ones are still cataloged? If new PDFs are being indexed, it sounds like the robots.txt is not set up correctly. If just the old ones are still indexed, you may have to wait until the next crawl of your site, or even go into the various "webmaster" tools for a search engine and ask for the PDF files to be removed. Once you know you have the right recipe in robots.txt, I suppose you could always change all the existing PDF file names (and change all your references to them), so that the indexed PDFs no longer exist and will be removed by the search engines.
Logged

Visit My Site

E-mail Me
  
-= From the ashes shall rise a sooty tern =-
Jason Martin
Spaceship Navigator
*****
Offline Offline

Posts: 85


« Reply #9 on: August 02, 2012, 10:14:21 AM »

Check robots.txt file properly.....
Logged

Miketys0n
Trekkie
**
Offline Offline

Posts: 16


« Reply #10 on: October 11, 2012, 11:45:49 PM »

you need to make changes on robots.txt. put dis allow on the links that you dont want to show on search engine. it will tell search engine to do not index your documents.
Thanks
Logged
MrPhil
Senior Moderator
Berserker Poster
*****
Offline Offline

Posts: 5863



« Reply #11 on: October 12, 2012, 04:46:59 AM »

Note that search engines follow robots.txt's directives voluntarily. There's nothing to keep a "badly behaved" search engine from indexing things you don't want indexed. They may even deliberately look at listed items on the chance there's something valuable there. robots.txt cannot be used as a security mechanism to hide files.
Logged

Visit My Site

E-mail Me
  
-= From the ashes shall rise a sooty tern =-
sirghayoor
Space Explorer
***
Offline Offline

Posts: 6


« Reply #12 on: December 09, 2012, 02:32:06 AM »

thanks, these are two differet things which is given to us bu the site, search engine means to find out the sites which are working in the market.
Logged
Pages: [1]   Go Up
  Print  
 
Jump to: