From the Open-Publishing Calendar
From the Open-Publishing Newswire
Indybay Feature
Watch Out, Google
Nutch could rewrite the rules of search development -- especially with an impressive roster of Internet luminaries now lining up behind it.
Ask anyone in Silicon Valley what the hottest application on the Internet is today and you can bet their answer will be search. The dealmaking has been nothing short of torrid. Only a year ago there were at least half a dozen major players. Now there are just three: Yahoo (YHOO), which last month bought search giant Overture (OVER) in a $1.6 billion deal; Google, the undisputed king of search; and Microsoft (MSFT), which is busy building a search platform of its own. They're all fighting to dominate the huge and ballooning market, already worth $2 billion and expected to generate between $6 billion and $8 billion in revenues by 2007.
Microsoft goes gunning for Google
HELEN JUNG; The Associated Press
REDMOND - Microsoft Corp. may be the most recognized software company on the planet, but when it comes to searching the Internet, people are much more likely to "Google it."
Microsoft wants to change that, and it's betting millions that someday it will be as well known for searching as Google. The company's push comes amid an exponential growth in information - on desktop computers, on online photo albums, on Web sites.
"And the more information there is out there, the more difficult it becomes to find relevant information and content," said Rob Lancaster, a senior analyst with the Boston-based Yankee Group research company. "The information glut, as it is popularly known, is becoming a real problem for lots of businesses."
Beefing up its search power is a smart move for Microsoft, Lancaster said, and should strike some fear in Google, Yahoo! and other companies that offer search engines.
It won't be easy to shove those two aside, however, said Danny Sullivan, editor of Search Engine Watch online newsletter, noting their loyal followings.
And the field is getting even more crowded as companies realize the multibillion-dollar market for searching - and search-related advertising. IBM Corp. last week announced its search engine, WebFountain, which is designed to not only find text online but also to analyze its meaning.
Still, Microsoft has a strong position as one of the top three search sites already on the Web. "Unless they make some terrible mistake, they're going to continue to be a very strong player," Sullivan said. "If they've decided it's important and they want to grind away at trying to solve the problem, they have a good track record of putting together good software to do that sort of thing."
Microsoft has its eyes set beyond mapping the World Wide Web.
It is developing search-related technologies to do everything from sorting through digital photos to combing through items scattered on your desktop computers, in an effort to answer an Information Age-old problem. How do you find what you're looking for?
"Information management is a really important problem," said Susan Dumais, a senior researcher for Microsoft Research, who is developing a tool for rapidly finding material that users have seen - regardless of whether it was an e-mail, Web site or document.
Some of Microsoft's efforts to simplify searching on the Internet will soon be in place. The new version of Microsoft's MSN Internet service, available this winter, will include a tool for retrieving digital photos based on faces or similar backgrounds. For example, users can ask their computers to retrieve all pictures that include a specific person's face.
But many are watching most closely the company's project to develop its own indexing and searching system for the Internet - and how the technology might later be deployed throughout the company.
Analysts estimate that Microsoft, which has long relied on outside companies to provide the search tool on its MSN Web site, is spending millions of dollars on developing its new search engine. Microsoft itself won't comment on how much it is spending, how many people it is devoting to the project or possible acquisitions.
MSN decided several months ago it was time to create its own search technology instead of relying on search companies Inktomi and Overture, said Kirk Koenigsbauer, general manager of MSN.com. He said it was unrelated to moves by Yahoo! Inc., dating back to December 2002, to acquire Inktomi, and more recently, Overture.
Rather, Microsoft saw how important searching has become, Koenigsbauer said, and contends that no one is really doing a good job answering those queries.
Although many do find what they are looking for, there are numerous ways that all search engines can better sort through the mass of Web sites to hone results better, said Charlene Li of Forrester Research.
That gives Microsoft an "in" to displace Google and Yahoo! If Microsoft can build a better search engine, "it's wide open at this point," she said.
Koenigsbauer would not say when Microsoft's new search tool will appear, or what technical changes Microsoft is making to improve search.
"That's the secret sauce," he said.
But he said better personalization is one way to improve searching. For example, if MSN knows that the computer user searching for "pizza" lives in a specific ZIP code, it can deliver results of pizza places in that ZIP code.
Spokespeople for Google and Yahoo! recently said no one from their companies would be immediately available by phone or e-mail to comment on the potential for competition in search technology from Microsoft.
Beyond satisfying consumers, better searching can be lucrative.
Companies pay or bid for inclusion in a search site's listings - typically in a cordoned off section for advertisers - based on the keywords the user types in. For example, a company that sells shoes might pay to be included on queries for "Manolo Blahnik sandals."
Such paid listings are expected to generate more than $2 billion in revenue for search sites in 2003, Forrester Research's Li said.
But the technology may help Microsoft focus on searching in ways that go deeper than the Web.
Company researchers, including Dumais, are studying how people narrow down searches for documents they've seen before and want to retrieve - using special dates as a memory cue or the sender of the document as an identifying characteristic.
Others - led by Gordon Bell in Microsoft Research's lab in San Francisco - are looking at how to build what amounts to a computer backup for people's memories. Bell has turned phone calls, bills, pictures, music and other personal effects into digital files stored on a computer hard drive with a search tool to sort through the mass of information.
Although Microsoft has not revealed many details about its new Longhorn operating system, the company has said it plans to build a unified file system that allows a quick search across everything in a computer, regardless of whether it is an e-mail or other specialized document.
Electronic data is only going to grow, said Dumais. "If you have to struggle through looking for things in hundreds of different places, it's just going to be intolerable," she said.
http://www.tribnet.com/business/story/3986857p-4008445c.html
The modern publishing industry can be said to have started in 1455 when Gutenberg's first Bible was printed. Since that day publishers have been looking for cheaper ways to get things done, and today they are clearly interested in Open Source. At the Seybold publishing conference in San Francisco last week, we were surprised to find Open Source alive and well in four of the first six booths we visited.
Publishing -- by firms that produce newspapers, magazines, books of all kinds, and even corporate documents -- is a very well understood business where the leaders are firms who have cut costs to the absolute minimum and exist on very thin margins, thanks to intense competition from other publishers and other media, including, nowadays, the Internet.
The print publishing industry has pretty much standardized on a handful of creation tools. While there are good Open Source tools like The GIMP that have a lot to offer, most people in publishing won't consider changing their current tools and platform. They have too much invested in training, and most print production jobs advertise for specific skills in applications like QuarkXPress, a page layout package, Adobe Photoshop, an image editing application, and Adobe Illustrator for vector graphic manipulation.
Print people are also often too busy to learn new software -- modern publishing workers are expected to be very productive. They have automated lots of their daily chores using scripting, but it's not often Perl: an amazing amount of this work is done on Macintoshes, and AppleScript is probably the most-used automation tool. Chicago-based R.R. Donnelly, one of the world's biggest printers, has literally millions of lines of AppleScript, and a coding department to maintain them.
Two of the growth areas in publishing technology are digital asset management and production automation. Publishers are eager to wring every last dollar out of the materials they create and it's usually much cheaper to create new products out of work that's already on hand than to start from scratch. An example would be a magazine company repackaging related articles into a special edition. One problem is finding the material -- large companies like AOL Time-Warner have literally billions of files on hand, often spread around plants and servers all over the world. Another problem is automating the repurposing process -- it can be just as time-consuming (read "expensive") for human operators to reformat content by hand from, say, print to HTML as it is to create the stuff in the first place.
To offer help with digital asset management, Google was at Seybold this week pitching its Search Appliance, a bright yellow, 1U rack-mount server that is designed to be dropped into a file server farm, where it will handle the chore of indexing everything it can find. Google publicist Nathan Tyler says the machine runs a custom version of Linux, derived from Red Hat, and optimized to run Google's proprietary search algorithms.
So much for finding stuff, although one wonders if Nutch, the recently announced Open Source search engine project, may result in code that could be adapted to make custom search appliances that run on commodity hardware. There could be an opportunity for developers to install and customize these appliances for large media companies or to develop their own products, possibly at lower cost than Google and other proprietary search vendors.
In the area of production automation, two companies that use Open Source to make it easier to repurpose content are Exegenix and Innovation Gate. Both companies build proprietary XML products on Open Source technologies like Tomcat, Apache, and Linux. XML, by the way, has been a kind of Holy Grail in print publishing for years. Publishers know that if they can tag content for what it is -- headline, byline, body text, etc. -- it becomes a lot easier to automate the repurposing process. Unfortunately, the products they have used for more than a decade only relatively recently added capabilities to tag content for what it is -- e.g. "body copy" -- rather than for what it should look like -- e.g. "12-point Times Roman."
Hand conversion of giant archives is regarded as impossibly expensive. So Exegenix and Innovation Gate offer products that take files in a wide variety of input formats, plug them into XML as best they can, then display the results to operators who can catch and fix problems. Once the files are in XML, operators can set up templates that automate the creation of HTML pages, PDF files, and new print pages much more quickly than formatting each by hand. In the case of the similar classes of documents found at insurance companies and financial services firms, the time savings can be huge. Innovation Gate, which runs on any platform that supports J2EE, and Exegenix, which runs natively on Win32 and Linux, both have rosters of corporate clients who use their products to reduce costs and speed work flow.
Artifex is one of the oldest successful publishing-oriented Open Source developer shops. According to CTO Raph Levien, Artifex has been developing products based on Ghostscript since 1988, and claims more than 80 OEM customers, including IBM, HP, Macromedia, and Xerox. Artifex offers a very complete set of graphics libraries that let other products -- everything from software to ink jet printers -- render pages from PostScript, PDF, PCL, and other formats. The core company has 10 people, and often taps contractors and volunteers from legions of Ghostscript developers all over the world for specific projects.
There are more opportunities: the raster-image processors used by printers and plate-makers have traditionally been expensive, proprietary software running on proprietary Unices like Irix. No one has yet managed to build a general purpose publishing workflow system that has attracted more than a point or two of market share, even though the publishing industry is standardizing on an XML format called job definition format. Proprietary systems have in the past been too expensive and inflexible, and publishers have been burned by vendor lock-in strategies. There may well be opportunities for MySQL and PostgreSQL developers to learn how to apply their skills to workflow systems.
Open Source developers can also find lots of niche opportunities -- publishing is a huge and varied field, and these customers will listen to developers who can save them money. Where Gutenberg failed, an Open Source developer may well succeed.
Chris Gulker, a Silicon Valley-based freelance technology writer, has authored more than 130 articles and columns since 1998. He shares an office with 7 computers that mostly work, an Australian Shepherd, and a small gray cat with an attitude.
http://newsforge.com/article.pl?sid=03/09/12/1544240&mode=thread&tid=3
Nutch is a nascent effort to implement an open-source web search engine.
Web search is a basic requirement for internet navigation, yet the number of web search engines is decreasing. Today's oligopoly could soon be a monopoly, with a single company controlling nearly all web search for its commercial gain. That would not be good for users of the internet.
Nutch provides a transparent alternative to commercial web search engines. Only open source search results can be fully trusted to be without bias. (Or at least their bias is public.) All existing major search engines have proprietary ranking formulas, and will not explain why a given page ranks as it does. Additionally, some search engines determine which sites to index based on payments, rather than on the merits of the sites themselves. Nutch, on the other hand, has nothing to hide and no motive to bias its results or its crawler in any way other than to try to give each user the best results possible.
Nutch aims to enable anyone to easily and cost-effectively deploy a world-class web search engine. This is a substantial challenge. To succeed, Nutch software must be able to:
fetch several billion pages per month
maintain an index of these pages
search that index up to 1000 times per second
provide very high quality search results
operate at minimal cost
This is a challenging proposition. If you believe in the merits of this project, please help out, either as a developer or with a donation
http://www.nutch.org/docs/en/
© The Nutch Organization, 2003
Microsoft goes gunning for Google
HELEN JUNG; The Associated Press
REDMOND - Microsoft Corp. may be the most recognized software company on the planet, but when it comes to searching the Internet, people are much more likely to "Google it."
Microsoft wants to change that, and it's betting millions that someday it will be as well known for searching as Google. The company's push comes amid an exponential growth in information - on desktop computers, on online photo albums, on Web sites.
"And the more information there is out there, the more difficult it becomes to find relevant information and content," said Rob Lancaster, a senior analyst with the Boston-based Yankee Group research company. "The information glut, as it is popularly known, is becoming a real problem for lots of businesses."
Beefing up its search power is a smart move for Microsoft, Lancaster said, and should strike some fear in Google, Yahoo! and other companies that offer search engines.
It won't be easy to shove those two aside, however, said Danny Sullivan, editor of Search Engine Watch online newsletter, noting their loyal followings.
And the field is getting even more crowded as companies realize the multibillion-dollar market for searching - and search-related advertising. IBM Corp. last week announced its search engine, WebFountain, which is designed to not only find text online but also to analyze its meaning.
Still, Microsoft has a strong position as one of the top three search sites already on the Web. "Unless they make some terrible mistake, they're going to continue to be a very strong player," Sullivan said. "If they've decided it's important and they want to grind away at trying to solve the problem, they have a good track record of putting together good software to do that sort of thing."
Microsoft has its eyes set beyond mapping the World Wide Web.
It is developing search-related technologies to do everything from sorting through digital photos to combing through items scattered on your desktop computers, in an effort to answer an Information Age-old problem. How do you find what you're looking for?
"Information management is a really important problem," said Susan Dumais, a senior researcher for Microsoft Research, who is developing a tool for rapidly finding material that users have seen - regardless of whether it was an e-mail, Web site or document.
Some of Microsoft's efforts to simplify searching on the Internet will soon be in place. The new version of Microsoft's MSN Internet service, available this winter, will include a tool for retrieving digital photos based on faces or similar backgrounds. For example, users can ask their computers to retrieve all pictures that include a specific person's face.
But many are watching most closely the company's project to develop its own indexing and searching system for the Internet - and how the technology might later be deployed throughout the company.
Analysts estimate that Microsoft, which has long relied on outside companies to provide the search tool on its MSN Web site, is spending millions of dollars on developing its new search engine. Microsoft itself won't comment on how much it is spending, how many people it is devoting to the project or possible acquisitions.
MSN decided several months ago it was time to create its own search technology instead of relying on search companies Inktomi and Overture, said Kirk Koenigsbauer, general manager of MSN.com. He said it was unrelated to moves by Yahoo! Inc., dating back to December 2002, to acquire Inktomi, and more recently, Overture.
Rather, Microsoft saw how important searching has become, Koenigsbauer said, and contends that no one is really doing a good job answering those queries.
Although many do find what they are looking for, there are numerous ways that all search engines can better sort through the mass of Web sites to hone results better, said Charlene Li of Forrester Research.
That gives Microsoft an "in" to displace Google and Yahoo! If Microsoft can build a better search engine, "it's wide open at this point," she said.
Koenigsbauer would not say when Microsoft's new search tool will appear, or what technical changes Microsoft is making to improve search.
"That's the secret sauce," he said.
But he said better personalization is one way to improve searching. For example, if MSN knows that the computer user searching for "pizza" lives in a specific ZIP code, it can deliver results of pizza places in that ZIP code.
Spokespeople for Google and Yahoo! recently said no one from their companies would be immediately available by phone or e-mail to comment on the potential for competition in search technology from Microsoft.
Beyond satisfying consumers, better searching can be lucrative.
Companies pay or bid for inclusion in a search site's listings - typically in a cordoned off section for advertisers - based on the keywords the user types in. For example, a company that sells shoes might pay to be included on queries for "Manolo Blahnik sandals."
Such paid listings are expected to generate more than $2 billion in revenue for search sites in 2003, Forrester Research's Li said.
But the technology may help Microsoft focus on searching in ways that go deeper than the Web.
Company researchers, including Dumais, are studying how people narrow down searches for documents they've seen before and want to retrieve - using special dates as a memory cue or the sender of the document as an identifying characteristic.
Others - led by Gordon Bell in Microsoft Research's lab in San Francisco - are looking at how to build what amounts to a computer backup for people's memories. Bell has turned phone calls, bills, pictures, music and other personal effects into digital files stored on a computer hard drive with a search tool to sort through the mass of information.
Although Microsoft has not revealed many details about its new Longhorn operating system, the company has said it plans to build a unified file system that allows a quick search across everything in a computer, regardless of whether it is an e-mail or other specialized document.
Electronic data is only going to grow, said Dumais. "If you have to struggle through looking for things in hundreds of different places, it's just going to be intolerable," she said.
http://www.tribnet.com/business/story/3986857p-4008445c.html
The modern publishing industry can be said to have started in 1455 when Gutenberg's first Bible was printed. Since that day publishers have been looking for cheaper ways to get things done, and today they are clearly interested in Open Source. At the Seybold publishing conference in San Francisco last week, we were surprised to find Open Source alive and well in four of the first six booths we visited.
Publishing -- by firms that produce newspapers, magazines, books of all kinds, and even corporate documents -- is a very well understood business where the leaders are firms who have cut costs to the absolute minimum and exist on very thin margins, thanks to intense competition from other publishers and other media, including, nowadays, the Internet.
The print publishing industry has pretty much standardized on a handful of creation tools. While there are good Open Source tools like The GIMP that have a lot to offer, most people in publishing won't consider changing their current tools and platform. They have too much invested in training, and most print production jobs advertise for specific skills in applications like QuarkXPress, a page layout package, Adobe Photoshop, an image editing application, and Adobe Illustrator for vector graphic manipulation.
Print people are also often too busy to learn new software -- modern publishing workers are expected to be very productive. They have automated lots of their daily chores using scripting, but it's not often Perl: an amazing amount of this work is done on Macintoshes, and AppleScript is probably the most-used automation tool. Chicago-based R.R. Donnelly, one of the world's biggest printers, has literally millions of lines of AppleScript, and a coding department to maintain them.
Two of the growth areas in publishing technology are digital asset management and production automation. Publishers are eager to wring every last dollar out of the materials they create and it's usually much cheaper to create new products out of work that's already on hand than to start from scratch. An example would be a magazine company repackaging related articles into a special edition. One problem is finding the material -- large companies like AOL Time-Warner have literally billions of files on hand, often spread around plants and servers all over the world. Another problem is automating the repurposing process -- it can be just as time-consuming (read "expensive") for human operators to reformat content by hand from, say, print to HTML as it is to create the stuff in the first place.
To offer help with digital asset management, Google was at Seybold this week pitching its Search Appliance, a bright yellow, 1U rack-mount server that is designed to be dropped into a file server farm, where it will handle the chore of indexing everything it can find. Google publicist Nathan Tyler says the machine runs a custom version of Linux, derived from Red Hat, and optimized to run Google's proprietary search algorithms.
So much for finding stuff, although one wonders if Nutch, the recently announced Open Source search engine project, may result in code that could be adapted to make custom search appliances that run on commodity hardware. There could be an opportunity for developers to install and customize these appliances for large media companies or to develop their own products, possibly at lower cost than Google and other proprietary search vendors.
In the area of production automation, two companies that use Open Source to make it easier to repurpose content are Exegenix and Innovation Gate. Both companies build proprietary XML products on Open Source technologies like Tomcat, Apache, and Linux. XML, by the way, has been a kind of Holy Grail in print publishing for years. Publishers know that if they can tag content for what it is -- headline, byline, body text, etc. -- it becomes a lot easier to automate the repurposing process. Unfortunately, the products they have used for more than a decade only relatively recently added capabilities to tag content for what it is -- e.g. "body copy" -- rather than for what it should look like -- e.g. "12-point Times Roman."
Hand conversion of giant archives is regarded as impossibly expensive. So Exegenix and Innovation Gate offer products that take files in a wide variety of input formats, plug them into XML as best they can, then display the results to operators who can catch and fix problems. Once the files are in XML, operators can set up templates that automate the creation of HTML pages, PDF files, and new print pages much more quickly than formatting each by hand. In the case of the similar classes of documents found at insurance companies and financial services firms, the time savings can be huge. Innovation Gate, which runs on any platform that supports J2EE, and Exegenix, which runs natively on Win32 and Linux, both have rosters of corporate clients who use their products to reduce costs and speed work flow.
Artifex is one of the oldest successful publishing-oriented Open Source developer shops. According to CTO Raph Levien, Artifex has been developing products based on Ghostscript since 1988, and claims more than 80 OEM customers, including IBM, HP, Macromedia, and Xerox. Artifex offers a very complete set of graphics libraries that let other products -- everything from software to ink jet printers -- render pages from PostScript, PDF, PCL, and other formats. The core company has 10 people, and often taps contractors and volunteers from legions of Ghostscript developers all over the world for specific projects.
There are more opportunities: the raster-image processors used by printers and plate-makers have traditionally been expensive, proprietary software running on proprietary Unices like Irix. No one has yet managed to build a general purpose publishing workflow system that has attracted more than a point or two of market share, even though the publishing industry is standardizing on an XML format called job definition format. Proprietary systems have in the past been too expensive and inflexible, and publishers have been burned by vendor lock-in strategies. There may well be opportunities for MySQL and PostgreSQL developers to learn how to apply their skills to workflow systems.
Open Source developers can also find lots of niche opportunities -- publishing is a huge and varied field, and these customers will listen to developers who can save them money. Where Gutenberg failed, an Open Source developer may well succeed.
Chris Gulker, a Silicon Valley-based freelance technology writer, has authored more than 130 articles and columns since 1998. He shares an office with 7 computers that mostly work, an Australian Shepherd, and a small gray cat with an attitude.
http://newsforge.com/article.pl?sid=03/09/12/1544240&mode=thread&tid=3
Nutch is a nascent effort to implement an open-source web search engine.
Web search is a basic requirement for internet navigation, yet the number of web search engines is decreasing. Today's oligopoly could soon be a monopoly, with a single company controlling nearly all web search for its commercial gain. That would not be good for users of the internet.
Nutch provides a transparent alternative to commercial web search engines. Only open source search results can be fully trusted to be without bias. (Or at least their bias is public.) All existing major search engines have proprietary ranking formulas, and will not explain why a given page ranks as it does. Additionally, some search engines determine which sites to index based on payments, rather than on the merits of the sites themselves. Nutch, on the other hand, has nothing to hide and no motive to bias its results or its crawler in any way other than to try to give each user the best results possible.
Nutch aims to enable anyone to easily and cost-effectively deploy a world-class web search engine. This is a substantial challenge. To succeed, Nutch software must be able to:
fetch several billion pages per month
maintain an index of these pages
search that index up to 1000 times per second
provide very high quality search results
operate at minimal cost
This is a challenging proposition. If you believe in the merits of this project, please help out, either as a developer or with a donation
http://www.nutch.org/docs/en/
© The Nutch Organization, 2003
For more information:
http://www.business2.com/articles/mag/0,16...
Add Your Comments
We are 100% volunteer and depend on your participation to sustain our efforts!
Get Involved
If you'd like to help with maintaining or developing the website, contact us.
Publish
Publish your stories and upcoming events on Indybay.
Topics
More
Search Indybay's Archives
Advanced Search
►
▼
IMC Network