You are asked to design and implement an information retrieval / search engine s
ID: 3680288 • Letter: Y
Question
You are asked to design and implement an information retrieval / search engine system for enterprise search for at least 10,000,000 documents that are on an internal internet and is expected to grow.
– Explain to your clients what an enterprise search engine is by describing its basic components and what each does. What differences are there between enterprise search and a web search engine such as Google?
– What are the important characteristics you must consider in the design including the size of storage needed? What is the storage for a mixture of documents and images?
– Define the measures that will help you evaluate its performance, including how you can design for scaling.
Explanation / Answer
A)
1. Enterprise search is the practice of making content from multiple enterprise-type sources, such as databases and intranets, searchable to a defined audience.
2. "Enterprise Search" is used to describe the software of search information within an enterprise (though the search function and its results may still be public).
3. Enterprise search can be contrasted with web search, which applies search technology to documents on the open web, and desktop search, which applies search technology to the content on a single computer.
4. Enterprise search systems index data and documents from a variety of sources such as: file systems, intranets, document management systems, e-mail, anddatabases. Many enterprise search systems integrate structured and unstructured data in their collections.
5. Enterprise search systems also use access controls to enforce a security policy on their users.
6. Enterprise search can be seen as a type of vertical search of an enterprise.
7. Components of an enterprise search system: In an enterprise search system, content goes through various phases from source repository to search results:
-Content awareness
Content awareness (or "content collection") is usually either a push or pull model. In the push model, a source system is integrated with the search engine in such a way that it connects to it and pushes new content directly to its APIs. This model is used when realtime indexing is important. In the pull model, the software gathers content from sources using a connector such as a web crawler or a database connector. The connector typically polls the source with certain intervals to look for new, updated or deleted content.
-Content processing and analysis
Content from different sources may have many different formats or document types, such as XML, HTML, Office document formats or plain text. The content processing phase processes the incoming documents to plain text using document filters. It is also often necessary to normalize content in various ways to improve recall or precision. These may include stemming,lemmatization, synonym expansion, entity extraction, part of speech tagging.
As part of processing and analysis, tokenization is applied to split the content into tokens which is the basic matching unit. It is also common to normalize tokens to lower case to provide case-insensitive search, as well as to normalize accents to provide better recall.
-Indexing
The resulting text is stored in an index, which is optimized for quick lookups without storing the full text of the document. The index may contain the dictionary of all unique words in the corpus as well as information about ranking and term frequency.
-Query Processing
Using a web page, the user issues a query to the system. The query consists of any terms the user
enters as well as navigational actions such as faceting and paging information.
-Matching
The processed query is then compared to the stored index, and the search system returns results (or "hits") referencing source documents that match. Some systems are able to present the document as it was indexed.
8. Differences from web search
Beyond the difference in the kinds of materials being indexed, enterprise search systems also typically include functionality that is not associated with the mainstream web search engines. These include:
B)
1. In computer architecture the memory hierarchy is a concept used when discussing performance issues in computer architectural design, algorithm predictions, and the lower level programming constructs such as involvinglocality of reference.
2. The memory hierarchy in computer storagedistinguishes each level in the hierarchy by response time. Since response time, complexity, and capacity are related, the levels may also be distinguished by their performance and controlling technologies.
3. The many trade-offs in designing for high performance will include the structure of the memory hierarchy, i.e. the size and technology of each component. So the various components can be viewed as forming a hierarchy of memories (m1,m2,...,mn) in which each member mi is in a sense subordinate to the next highest member mi+1 of the hierarchy.
4. To limit waiting by higher levels, a lower level will respond by filling a buffer and then signaling to activate the transfer.
There are four major storage levels.
This is a general memory hierarchy structuring. Many other structures are useful. For example, a paging algorithm may be considered as a level forvirtual memory when designing a computer architecture, and one can include a level of nearline storage between online and offline storage.
C)
1. Improvement in individual, group, or organizational performance cannot occur unless there is some way of getting performance feedback. Feedback is having the outcomes of work communicated to the employee, work group, or company.
2.For an individual employee, performance measures create a link between their own behavior and the organization's goals. For the organization or its work unit's performance measurement is the link between decisions and organizational goals.
3. It has been said that before you can improve something, you have to be able to measure it, which implies that what you want to improve can somehow be quantified.
4. Additionally, it has also been said that improvement in performance can result just from measuring it. Whether or not this is true, measurement is the first step in improvement. But while measuring is the process of quantification, its effect is to stimulate positive action.
5. Managers should be aware that almost all measures have negative consequences if they are used incorrectly or in the wrong situation.
6. Managers have to study the environmental conditions and analyze these potential negative consequences before adopting performance measures.
Related Questions
drjack9650@gmail.com
Navigate
Integrity-first tutoring: explanations and feedback only — we do not complete graded work. Learn more.