Jay Allen (of MT-Blacklist fame) announced a new project to fight blog spam. And Mark Pilgrim warns him to mind the people he picks up the fight with.
I agree with Mark's points, but I think he (both actually) miss an important difference between other forms of spam and this comment spam: The primary target here are not the hundred thousands of blogs and their millions of readers. The target are very few companies: Search Engines that use link analysis to rank results, foremost Google.
The goal of the spammers is to generate a high pagerank with good anchor texts for their sites in a short time. Often from scratch, as they will probably have to abandon their domains as quickly. And as Mark explains, the people behind the spams are profit-oriented businesses. If the method principally doesn't work anymore, they will stop spending time and money there.
And the primary targets, e.g. Google, are in a good position to detect them even more quickly: Most comment systems follow a similar convention, especially for dating the comments. Given a deep crawl of the web, it is possible to count the numbers of comments a particular blog owner made on any given day. Set a threshold to something humanly possible and ignore any number of links above that level. (Similar for other links mentioned in the comments; correlate it to the emergence of identical links outside of blog comments). If spammers start linking to subpages or subdomains instead of homepages, treat them the same (whitelist blog hosters). Using many domains will quickly be too costly and if not, limit the total effect of indirect incoming googlejuice by day.
Others proposed to disable the pagerank-effect by redirecting all outgoing links through a script hidden by robots.txt. I fear that - unless this idea gains immense popularity - this won't help, as it is easier to still spam than to detect beforehand that the spamming attempt is futile. And it breaks the web in an important part.
I applaud Jay Allen's initiative, but I think the most powerful defendant of the blogosphere's comment sections is the ultimate target of the spam attacks: Google. (And it would probably be the first antispam measure that would be more effective with being publicly announced!)
In any case, the idea of keeping a worldwide daily comment count might also be useful to populate the blacklist.
Update: By no means, I suggest that the blog community should lay back and wait for Google to fix the problem. But I do think that the problem would be most efficiently fixed at Google, if only because all other anti-spam solutions would be limited to blogs installing extra-software.
Indeed, I think that the idea of global daily comment counts should be considered within the blam project. By the nature of the attack spammers will reveal themselves with a inhumanly high number of comments in the blogosphere.
An additional way to find spammers is to look for identical or near-identical comments (again, works for Google and blam).
A nice side-effect of the project would be a technorati for comment space! Indeed, that would allow me to aggregate all my comments I spread over the web back into my blog. If blam decides against implementing something like this, because it is beyond scope, then I forward this request to the oh, dear LazyWeb.
Posted by seefeld at November 16, 2003 23:42I agree with what you said: blog comment spam is different from e-mail spam in that the main purpose is just "page rank pushing". Google would be the perfect place to fight against this tactic, but... Google doesn't seem to be interested at all.
Maybe I'm wrong, but gaming google's page rank isn't an idea blog spammers came up with. It's by far older than blog spam, and is excessively being used by link farms and other crap. This stuff is well known, and what is google's reaction? Not very much. I've seen some link-farms removed from google's database, and in the next minutes ten new came up. It's like in politics: they fight the symptoms, not the source of the problem itself.
Personally, I won't wait for google to take actions against pagerank gaming. I think blog spam must be stopped now, and we need working solutions without having the blogging community inventing the wheel over and over again. And this is where blam comes in: by specifying data structures and protocols that virtually can be used with every blogging solution and by providing a reference implementation of a blacklist management tool that drops the time that is needed for useful blacklist support. Time will tell if this goal can be reached or not.
The success of blam will depend on how the community supports this project. So feel free to join the developers mailing list or drop in to the IRC channel for a on-the-fly exchange of ideas. More information on the project's website in the "join us" section.
Bye, Mike
Posted by: Michael Renzmann at November 17, 2003 06:50 AMMike, I updated the entry to address your concerns.
OTOH, I don't think that Google ignores the spam problem per se. I just think, that they concentrate on automatic solutions instead of manually deleting linkfarms et al. This is much harder to do, but scales better. We will see how this works out in the long term. But as my post outlines, comment spam should be easy to detect automatically for Google, so I still have my hopes there...
Posted by: Bernhard Seefeld at November 17, 2003 11:04 AMHow would google tell the difference between the link which is for the spammer's ID, and a link within the comment itself to a legitimate site?
For example, while the link behind my name should be throttled to human limits, if I happen to include a link to whatever is on top of today's blogdex then that link should not be throttled.
This is something that the blogs themselves can tell the difference (being at the input end), but google might have a much harder time doing so. Unfortunately, in either case, spammers only need to put their nefarious links into the comment text itself, and then sign it with some random/anony name/link.
Perhaps one solution remains though with comment-registration services like TypeKey ... if they could provide an API to get the daily-comment-count (whilst preserving privacy of actually *where* those comments were made) ... then spammers would need to register many fakes to hide behind, and the more they spam then the more fakes they need to register. Introduce some hurdle in the registration and we might have a chance.
Posted by: eric scheid at May 13, 2004 11:48 AMEric, I think you are correct that it would be hard (but not impossible) to distinguish the currently popular legitimate links from the spammed links. On the other hand, the damage of throttling these too much is maybe not such a big problem after all, it will only limit the (lasting!) effect of the one-time peak popularity.
I too think that the true reason there is such spam is Google's slightly outdated way of determining relevance.
There are a few quickfixes (e.g., ignoring any comment links), but of course they scale not too well. The algorithms just have to become better.
Posted by: mirko at June 11, 2004 06:24 AMVisit casinos 50webs - homepage of best casinos 50webs site on net. Think of every casinos 50webs http://online-casinos.50webs.com/casinos.html casinos 50webs you know - they do not match!
Posted by: casinos 50webs at August 1, 2005 06:41 PM