Engineering anchor text efficiently is technically difficult

Engineering a search engine is challenging
task. The web creates new challenges for information retrieval. Automated
search engines that rely on keyboard search matching usually return too many
low quality matches. We choose our search engine name, Google, because it fits
well with our goal of building very large-scale search. The goal of our system
is to address many of problems, both in quality and scalability. Fast crawling
technology is needed to gather the web documents and keep them up to date.
Storage space must be use efficiently. Queries must be handled quickly, at a
rate of hundreds to thousands per second. These tasks are becoming difficult as
the web grows. Some people believe that a complete search index will make
anything to find easily. Aside from tremendous growth the web have also become
increasingly commercial over time. One of our main goal in designing the Google
was to set up an environment where other researchers can come in quickly.
Google utilizes link to improve search result. PageRank can be thought of as a
model of user behavior. The text of link is treated in special way in our search
engine. Most search engines associate the text of link with the page that the
link is on. We use anchor propagation mostly because anchor text can help
provide better quality results. Using anchor text efficiently is technically
difficult because of large amount of data which may be process. In our current
crawl or 24 million pages, we had over 259 million anchors which we indexed.
Aside from PageRank Google has location to make search extensive proximity in
search. Words in large or bold are weighted more. If a user issues a query like
“Bill Clinton”
They should get reasonable result since there is a
enormous amount of high quality information available on this topic. We believe
standard information retrieval work need to be extended to deal effectively
with the web. The web is the vast collection of completely uncontrolled
documents. Another difference between the web and traditional well controlled
collection is that virtually there is no control over what people can put on
web. Most of Google is implemented in C or C++ for efficiency and can run in either
Solaris or Linux. In Google web crawling is done by several distributed web
crawlers. Google’s Data structures are optimized so that a large amount of data
can be crawled, indexed and searched with little cost. Google is designed to
avoid disk seeks whenever possible, and this has a considerable influence on
design of data structures. Big Files also support rudimentary compression
options. Repository contains the full HTML of every web page. The choice of
compression technique is a tradeoff between speed and compression ratio. We can
rebuild all the data structure from the repository and a file which list
crawler errors. The Document Index keeps information about each Document, and
the ability to fetch a record in one disk seek during a search. The Lexicon fit
in memory for a reasonable price. A hit list corresponds to the occurrence of a
particular word including information. Running a web crawler is challenging
task. There are reliability and performance issues and even more importantly,
there are social issues. In order to scale a hundred or millions of web pages
Google has a fast distributed crawling system. It turn out that running a
crawler which connects to more than a half million servers, generates a fair
amount of Phone calls and Emails. Because of vast numbers of People coming
online. There are always those who don’t know what a crawler is because this is
the first one they have seen. The goal of searching is to provide quality
search result efficiently. Many search engines have seem to make good progress
in term of efficiency, therefore we have focused on quality of our search on
our research. Google maintains more information about documents more than the
typical search engines. Combining all this information into a rank is
difficult. We have designed our ranking system so that no particular function
can have too much influence. A single word query is a simple case. In order to
rank a document with single word query, Google looks at hit list for that word.
For Multi word search the situation is more complicated. The ranking system has
many parameters, figuring out the right value for these parameters is something
of a black art. In order to do this, we have a user feedback mechanism in the
search engine. A trusted user may optionally evaluate all of these results that
are returned. This feedback is saved. The most important measure of a search
engine is the quality of its search result. Our own experience with Google has
shown it to produce better results than the major commercial search engines for
most searches. Google relied on anchor text to determine this was good answer
for the query. Similarly, another result is an email address which, of course
our immediate goal are to improve efficiency and to scale approximately 100
million pages. The biggest problem facing users of web search engine today is
the quality of the result they get back.


I'm Mary!

Would you like to get a custom essay? How about receiving a customized one?

Check it out