Academic Integrity: tutoring, explanations, and feedback — we don’t complete graded work or submit on a student’s behalf.

Research Article IT344 Project Nature: Analytical (Individual Assignment) What i

ID: 3768082 • Letter: R

Question

Research Article IT344

Project Nature:

Analytical (Individual Assignment)

What is Research article: Review articles provides summary of current state of the research on a particular topic. Ideally, the writer searches for everything relevant to the topic, and then sorts it all out into a comprehensible way. Review Articles will teach you about: the main people working in a field

· recent major advances and discoveries

· significant gaps in the research

· current debates

· ideas of where research might go next

There are many benefits to reading research articles (Dunifon, 2005):

·   Research articles are the best source of tested, evidence-based information.

· By going to the source of information, readers can draw their own conclusions about the quality of the research and its usefulness to their work.

· Readers can use the research to inform decisions about their programs, including

· decisions about program development, design, or discontinuation. Readers can incorporate the evidence into their practice or resource materials.

Project Description:

In this Project, Students are required to submit a research article on the topic related to Data Mining or Data warehouse.

Submit a research draft on the topic selected. Read at least 5 papers relevant to topic selected. The article should contain the following sections:

A. Introduction Section: You have to write at least one page introduction of the topic that you have selected in your own words. The introduction should explain the topic in simple English. You can use models or diagrams to explain the topic that you have selected. This section should not be more than two pages long.

B. Literature section: In this section, you have to describe research papers about the topic in your own words. The literature review provides the reader with a summary of other research related to the topic. It also addresses questions that remain unanswered or require additional research. In general, this is also the section where the authors’ research question is introduced, and hypotheses or anticipated results are stated. The articles that you select have to be some journal papers or conference proceedings. However, you cannot describe webpages as literature. At least 70% of the literature selected should be published after 2011. All the articles should be properly referred in the section.

  Do not copy any part of the article into your review. If you want to use more than 3-4 of the author's words, then use quotation marks, and add reference of the article

C. Methods: In this section, you have to introduce and explain major applications of the topic, which are currently in use. You can use models or diagrams to explain these applications. This section of the research article should outline the methodology the author(s) used in conducting the study. Including information on methods used allows readers to determine whether the study used appropriate research methods for the question being investigated.

D. Results: State research findings in this section. The results are often displayed using tables, charts, graphs or figures along with a written explanation.

E. Conclusion: In this section, you have to write one-two paragraph summary of your whole research in your own words. F. References: In this section, you have to give proper references of the literature that you have presented in section B. The format that you will use for writing references is IEEE format.

The document format that you have to follow for this assignment is IEEE single column format for conference papers. This format is available on the following link; www.ece.utah.edu/~ece1270/_IEEE_Template_1col_2sp.doc

Important things to remember 1. Use your own words. Do not copy/paste from internet. Safe Assign feature of blackboard will identify the text you have copied from internet and your marks will be deducted.

1. Do not copy/paste from any other group’s work. This will also be identified by Safe Assign feature and your assignment will be cancelled and marked zero.

2. Use diagrams/figures/models etc.to convey your topic properly. This will eventually help in increasing your marks.

3. Do not forget to refer whenever you describe some research. Always follow the method of referencing as given in the template

Project Nature:

Analytical (Individual Assignment)

Explanation / Answer

Abstract The ever growing repository of data in all fields poses new challenges to the modern analytical systems. Real-world datasets, with mixed numeric and nominal variables, are difficult to analyze and require effective visual exploration that conveys semantic relationships of data. Traditional data mining techniques such as clustering clusters only the numeric data. Little research has been carried out in tackling the problem of clustering high cardinality nominal variables to get better insight of underlying dataset. Several works in the literature proved the likelihood of integrating data mining with warehousing to discover knowledge from data. For the seamless integration, the mined data has to be modeled in form of a data warehouse schema. Schema generation process is complex manual task and requires domain and warehousing familiarity. Automated techniques are required to generate warehouse schema to overcome the existing dependencies. To fulfill the growing analytical needs and to overcome the existing limitations, we propose a novel methodology in this paper that permits efficient analysis of mixed numeric and nominal data, effective visual data exploration, automatic warehouse schema generation and integration of data mining and warehousing. The proposed methodology is evaluated by performing case study on real-world data set. Results show that multidimensional analysis can be performed in an easier and flexible way to discover meaningful knowledge from large datasets.

Keywords: Automatic Schema, Clustering, Data Warehouse, Multi-dimensional Analysis 1. Introduction The extensive use of computers and information technology has made extensive data collection a routine task in a variety of fields Continuously increasing data repositories can contribute significantly towards future decision making provided appropriate knowledge discovery mechanisms are applied for extracting hidden, but potentially useful information embedded in the data . One of the main mechanisms of knowledge discovery is the efficient analysis of data using modern analytical systems. A tough barrier to the efficient analysis of data is the presence of mixed numeric and nominal variables in real-world data sets. Abundant algorithms and techniques have been proposed in the literature for the analysis of numeric data but little research has been carried out to tackle the problem of mixed numeric and nominal data analysis. Traditional methodologies assume variables are numeric valued, but as application areas have grown from the scientific and engineering domains to the biological, engineering, and social domains, one has to deal with features, such as country, color, shape, and type of disease, that are nominal valued . In addition to the problem of efficient analysis of mixed data, high cardinality nominal variables with large number of distinct values such as product codes, country names, model types are not only difficult to analyze but also require effective visual exploration. Visualization techniques are becoming increasingly important for the analysis and exploration of large multidimensional data sets . However, the results of many visualization techniques such as parallel coordinates are affected by the order by which attributes are displayed . Moreover, accurate spacing among the attribute values is mandatory to recognize to the semantic relations in the underlying data. The major focus of this paper is the seamless integration of data mining and data warehousing. Data mining aims at the extraction of synthesized and previously unknown insights from large data sets It can be viewed as an automated application of algorithms to detect patterns and extract knowledge from the data that is not obvious to the user . Data warehousing is recognized as a key technology for the exploitation of massive amounts of data nowadays available electronically in many organizations . The two disciplines, namely data warehousing and data mining are both mature in their own right but surprisingly little research has been carried out in integration of these two strands of research. The key problem is that for the integration to occur in a seamless manner, the data has to be modeled in a data warehouse schema. Data warehouse modeling is a complex task, which involves knowledge of business processes of the domain of discourse, understanding the structural and behavioral system’s conceptual model, and familiarity with data warehouse technologies . There is an obvious need to automate the schema generation process to overcome the schema modeling complexities and domain dependencies. Additionally, the automation process is required not only to improve the analytical powers but also to increase the flexibility and adaptability of the existing data mining and warehousing systems.

2. Literature Review We review past work done relating in four major themes that relate most closely to the research that we undertake. These themes are: numeric and nominal analysis; visualization of multidimensional data; automatic generation of data warehouse schema; and integration of data mining methods into data warehousing design. 2.1. Numeric and nominal data analysis Real world datasets consist of a mix of numeric and nominal data. Specially, data sets with a large number of nominal variables, including some with large number of distinct values are becoming increasingly common . For the purpose of efficient analysis of mixed data sets, identified the problems associated with the traditional k-mean algorithm as best suited for numeric data only. In order to perform analysis on mixed data, the authors proposed a new algorithm which uses a cost function and distance measure based on co-occurrence of values. The proposed cost function alleviated the short-coming of cost-effectiveness of Huang’s cost function. The limitation of the proposed work is that the analysis relies on co-occurrences of data and discretizing of numeric values which leads to loss of information. For the same purpose, a feature selecttion algorithm for mixed data containing both continuous and nominal features was introduced by. The authors stressed that feature selection is a crucial step in pattern recognition. Furthermore, a new evaluation criterion has been used to avoid feature type transformation through careful decomposition of feature space. The limitation of the proposed algorithm is that it produced better results on the experiment performed on artificial data as compared to real-world data. In addition to this, the algorithm has not been compared in terms of computational cost. The proposed algorithm is computationally expensive as a first step it decomposes the feature space along values of nominal values and then combines these measures to produce an overall evaluation. In the quest of mixed data analysis, three different distance measure functions for efficient analysis of mixed variables were compared by the authors in. They identified the fact that there is a strong need to develop Mahalanobis-type distances for mixed variables. The reason is that research done in this regard is either heuristic of makes use of nominal data. The strength of their work is the comparison of measures for computing Mahalanobis-type distances between categorical and numeric dimensions. The limitation of the work is that they have used very small data sets having only a few records to perform the validation. Furthermore, in the nominal data they have not targeted variables with a large number of distinct values, often called the high cardinality nominal variables. In real world data sets there exist a large number of such variables and it is important to analyze these variables in order to identify the semantic relationships among the large number of distinct values present in each variable. Additionally, standard data visualization methods do not deal satisfactorily with high cardinality variables. These methods need to be enhanced significantly to deal with such variables.

2.2. Effective visualization of multi-dimensional data For effective visualization of data, authors in

introduced the similarity clustering of dimensions as an important technique of enhancing the results of a number of different multi-dimensional visualization techniques. Authors presented a number of similarity measures to determine similarities among dimensions to target the dimension arrangement problem. A heuristic solution based on an intelligent ant system has been proposed for enhanced visualization. The major limitation of the proposed work is that it is only applicable to simply three types of visualization techniques namely parallel coordinates, circle segment and recursive pattern. In addition to this, the work only supports the arrangement (order) of dimension and not the spacing among the values of each dimension. In order to extract useful information from the high cardinality nominal variables both effective spacing among the values is required. Meaningful spacing among the values plays a vital role in the interpretation of visualization results and helps in the recognition of meaningful patterns from the underlying data values. Experiments are performed on the data sets having a small number of dimensions. The work does not give any indication when high dimensional data has to be visualized. Visualization of higher dimensional data was not attempted. More specifically, parallel coordinates technique is more suitable for the datasets which have a small number (maximum 10) of dimensions. High dimensional datasets are difficult to visualize with this particular technique because the effectiveness of parallel coordinates display decreases with increase in the number of variables.

2.3. Automatic generation of DW schema Authors in, identified the fact that most of the research focuses on the automatic derivation of database schemata from conceptual models but neglect the automatic derivation of OLAP metadata in a way that is integrated with database schema. It was emphasized that such integration is extremely important because it allows end-users tools to query the data warehouse accurately and reduces development time and cost. Model-transformation architecture has been proposed in order to facilitate the automatic generation of warehouse schema. The research work has been implemented in an opensource development platform to automatically generate schema from conceptual multidimensional models. Likewise, suggested an Object-process-based Data Warehouse Construction Method (ODWC) for the construction of data warehouse schema. The method uses a stepwise rule-based algorithm for the derivation of schema from the source operational model. The proposed ODWC method overcomes the major limitation of manual work requirement to construct the schema and lack of automated assistance for the identification of facts and dimensions from the conceptual models. However, ODWC has been applied on only one case study to date. There is strong need to apply this method on various case studies to strengthen its effectiveness and applicability. Similarly, presented a technique for obtaining, in a fairly automatic way, a data warehouse designed over a set of source operational databases. The proposed technique takes in a list of database schemas in the form of ER model, a dictionary of lexical synonymy properties, and generates a data warehouse as an output. Again the limitation of the technique is that it is only evaluated merely on a single case study.

2.4. Exploiting data mining techniques in DWs An infrastructure for parallel multidimensional analysis and data mining was suggested by the authors in. The work done argued the use of OLAP queries which are ad hoc in nature and require fast computational time by pre-aggregating calculations. Data mining uses some of these pre-computed aggregated calculations to compute the probabilities needed for calculating support and confidence measures for association rules. In order to perform OLAP and data mining operations, authors in presented an algorithm and techniques for constructing data cubes and showed that these data cubes can be used for data mining using Attribute Focusing technique. Since data cubes have aggregation values on combinations of attributes already calculated, the computations of attribute focusing are greatly facilitated by data cubes. Authors claimed that dimensional hierarchies can be utilized to provide multiple level data mining. This is useful for mining at multiple concept levels and interesting information can potentially be obtained at different levels.

for more info please drop your mail in comment box

i ll mail to u remaining info

Hire Me For All Your Tutoring Needs
Integrity-first tutoring: clear explanations, guidance, and feedback.
Drop an Email at
drjack9650@gmail.com
Chat Now And Get Quote