Nutch is a nascent effort to implement an open-source web search engine.
Web search is a basic requirement for internet navigation, yet the number of web search engines is decreasing. Today's oligopoly could soon be a monopoly, with a single company controlling nearly all web search for its commercial gain. That would not be good for users of the internet.
Nutch provides a transparent alternative to commercial web search engines. Only open source search results can be fully trusted to be without bias. (Or at least their bias is public.) All existing major search engines have proprietary ranking formulas, and will not explain why a given page ranks as it does. Additionally, some search engines determine which sites to index based on payments, rather than on the merits of the sites themselves. Nutch, on the other hand, has nothing to hide and no motive to bias its results or its crawler in any way other than to try to give each user the best results possible.
Nutch aims to enable anyone to easily and cost-effectively deploy a world-class web search engine. This is a substantial challenge. To succeed, Nutch software must be able to:
- fetch several billion pages per month
- maintain an index of these pages
- search that index up to 1000 times per second
- provide very high quality search results
- operate at minimal cost
Nutch is hostet by the Internet Archive and is backed by Mitchell Kapor (of Lotus and OSAF fame), Tim O'Reilly (of O'Reilly & Associates) and others.
Apart from the political arguments, I think that such an open source web search engine will successfully attract contributions because a web search engine is a tool that is used intensively and diversely by software engineers (hackers hack hack tools). And, of course, it is a very interesting technology to work on - but maybe I am a bit biased here.
It will be interesting to watch how they will decide what code to run on the cluster. Apart from new ranking algorithms, there could be a lot of web mining to support and/or supplement search. The open source community will come up with a lot of ideas.
As for hardware financing, this will be a challenge, but on the other hand the prices for the necessary hardware will continue to drop rapidly (up to the point, where hardware is free). What will be costly is the maintenance. And most of it has to be done by full-time paid staff. But I guess, that the Internet Archive is a good place to look for experience here.
What puzzles me, is their current decision to implement it in Java. Maybe I am a bit old-fashioned here, but if your goal is to "operate at minimal cost" and scale to "several billions pages" and "1000 queries per second", then hardware costs will be a major factor and while in the age of JIT Java surely isn't factors slower than natively compiled languages, optimizing things like the L2 cache through several layers of abstractions introduced by language features, virtual machines and the JIT compiler will be extremely hard, but ultimately necessary (a percent performance gain might - if they are successful - save you a couple of hundreds of servers). Of course, they will resort to implement the crucial parts in C. But then, they could have used a dynamic script language (one of the P-languages like Python, Perl or PHP) as glue instead and would have had the advantages like faster development, etc.
A search engine is not a huge project in terms of number of lines of code. It is however quite dense in hard passages. A lot of room for smart algorithms and data structures and highly optimized code. With a language you also get a culture, and while there are a lot of smart people in the Java community, there expertise lies in big projects and clever abstractions and modeling. In my experience people interested in the guts of a search engine don't opt for Java in their own projects. But that might be just a bad prejudice of mine.
Interestingly, they seem to mandate unit tests for submitted code and I always wanted to see a non-trivial real test-driven project from inside, so this looks like a good opportunity to brush up my Java skills...
Ah, looking closer at the list of developers I spot Doug Cutting, 10+ year IR-veteran and Lucene author (which probably explains the choice of Java). And also Ben Lutch, co-founder of Excite. Exciting!
And with the Internet Archive nearby they will surely look into time-dependent link analysis. Many nice opportunities there!
Posted by seefeld at August 12, 2003 10:44Just want to inform you about the new search engine "Objects Search"
http://www.ObjectsSearch.com