CS614 – Subjective with Reference
Fall Spring 2012
Q#2. Difference between one to one and one to many transformation (2)
One-to-One Transformation
- Simple scalar transformation is a one-to-one mapping from one set of values to another
- set of values
- sufficient to ensure that the transformation is one-to-one.
- provides a design environment for creating data transformation applications.
- The transformation functions are polynomials.
One-to-Many Transformation
- A one-to-many transformation is more complex than scalar transformation
- It is data element form the source system results in several columns in the DW
- Code generation can also create transformation in easy-to-maintain computer languages such as Java or XSLT.
- a data transformation converts data from a source data format into destination data.
Q#3. Describe reason to summarization during data transformation (3)
The reason for this is to make transformation of data easy, and to be able to use a wide. In this term describe programs for transforming data for a grocery chain, sales data at the lowest level of detail for every transaction at the checkout may not be needed. Storing sales by product by store by day in the data warehouse may be quite adequate. So, in this case, the data transformation function includes summarization of daily sales by product and by store.
Ref: Handouts Page No. 136
Q#4. How clustering and associative rule work (3)
Clustering: Identify outlier records using clustering based on Euclidian (or other)
distance. Existing clustering algorithms provide little support for identifying outliers.
However, in some cases clustering the entire record space can reveal outliers that are not
identified at the field level inspection. The main drawback of this method is
computational time. The clustering algorithms have high computational complexity. For
large record spaces and large number of records, the run time of the clustering algorithms
is prohibitive.
Association rules: Association rules with high confidence and support define a different
kind of pattern. As before, records that do not follow these rules are considered outliers.
The power of association rules is that they can deal with data of different types. However,
Boolean association rules do not provide enough quantitative and qualitative information.
Ref: Handouts Page No. 146
Q#5. Splitting of single field transformation is used to store individual components of name and address in separate field of data warehousing, main reason to doing this (5)
Q#6 explain additive and non-additive and examples (5).
Additive facts are those facts which give the correct result by an addition operation.
Examples of such facts could be number of items sold, sales amount
Non-additive facts can also be added, but the addition gives incorrect results.
examples of non-additive facts are average, discount, ratios etc.
Ref: Handouts Page No. 104
Q3: 2 real life examples of clustering . 5 marks
These are real examples of Clustering
Marketing: Discovering distinct groups in customer databases, such as customers
who make lot of long-distance calls and don't have a job. Who are they? Students.
Marketers use this knowledge to develop targeted marketing programs.
Insurance: Identifying groups of crop insurance policy holders with a high average
claim rate. Farmers crash crops, when it is "profitable".
Land use: Identification of areas of similar land use in a GIS database.
Seismic studies: Identifying probable areas for oil/gas exploration based on
seismic data.
Ref: Handouts Page No. 264
Q4: purposes data profiling 3 marks
Data profiling is a powerful method to have an idea about the quality of data. While profiling data we need to run queries to identify:
· Inconsistencies in date formats
· Invalidities
· Missing values of dates
· Violations in business rules
Ref: Handouts Page No. 477
Q7: Data parallelism explain with example 3 marks
Data parallelism is simplest form of parallelization. The idea is that we have
parallel execution of single data operation across multiple partitions of data.
Examples
we have a question that we want to select the number of accounts where
balance is greater that 5,000$ and the open data is after first of June 2000. This account
table and what we end up doing is say ok send the query to each query server and each
query server then runs the query against a particular partition
Ref: Handouts Page No. 212
Q6: Meant of classification process, How measure accuracy of classification? 3marks
Classification process actually set is divided into two parts, one is called test set and the other is called the training set. We pick the training set and a model is constructed based on known facts, historical data and class properties as we already know the number of classes. After building the classification model, every record of the test set is posed to the classification model which decides the class of the input record.
Ref: Handouts Page No. 278
What is timestamps 3 marks
Timestamp specifies the time and date that a given row was last modified. If the tables in an operational system have columns containing timestamps, then the latest data can easily be identified using the timestamp columns
Ref: Handouts Page No. 150
How aggregates awareness helps the users
The existence of aggregate tables does not help user queries. Users must access the summary tables, ideally automatically with awareness built into the BI tool or the RDBMS to force the query to be processed in the most efficient way. In recent years, BI vendors have debated if aggregate awareness belongs in the BI tool or in the RDBMS. Business Objects is the only BI tool that has built-in functionality to take advantage of RDBMS aggregate tables in a way that is seamless to users.
How cube is created in ROLAP
Cube is a logical entity containing values of a certain fact at a certain aggregation level at an intersection of a combination of dimensions.
Ref: Handouts Page No. 88
Differentiate b/w MOLAP and ROLAP implementation
MOLAP
· implemented with a multi-dimensional data structure.
· It is physically builds "cubes" for direct access
· ANSI SQL is not supported.
· proprietary file format of a multi-dimensional database (MDD
ROLAP
· Relational OLAP provides access to information
· ANSI SQL is supported.
· data is stored in a relational database (e.g. a star schema)
· The fact table is a way of visualizing as an "un-rolled" cube.
Ref: Handouts Page No. 88
Factors behind poor data Quality
Poor quality data the pandemic problem that needs addressing. The factor of poor quality data
inability to match payroll records to the official employment records can cost millions in payroll overpayments to retirees, personnel on leave without pay status, and "ghost" employees. Inability to correlate purchase orders to invoices may be a major problem in unmatched disbursements. Resultant costs, such as payroll overpayments and unmatched disbursements, may be significant enough to warrant extensive changes in processes, systems, policy and procedure, and information system data designs.
Ref: Handouts Page No. 184
Aggregate or hardware which is best to enhance the DWH.
An aggregate is information stored in a data warehouse in summarized form. Aggregations are used to improve the performance of business intelligence tools - When queries run faster, they take up less processing time and the users get their information back more quickly.
· To exclude one or more dimensions when summarizing a fact table.
· To have one or more dimensions replaced by rolled up versions of themselves
· The goal of an aggregate program in a large data warehouse must be more than just improving performance.
Ref: http://www.dama-nj.org/presentations/SyncSort%20DAMA_NJ_Apr24-rpv6.pdf
CDC time stamping triggers, and portion which is best tell the reason.
Some source systems might use range partitioning, such that the source tables are
partitioned along a date key, which allows for easy identification of new data.
Reason :- if you are extracting from an orders table, and the orders table is partitioned by
week, then it is easy to identify the current week's data.
Ref: Handouts Page No. 151
Difference between Low granular and high granular
High Granular
higher the level of aggregation of the fact table,
fewer will be the number of dimensions can attach to the fact records.
The converse of this is surprising.
Low Granular
more granular the data, the more dimensions make sense.
data in any organization is the most dimensional.
Ref: Handouts Page No. 118
Briefly describe snowflake schema.
Snowflake Schema: Sometimes a pure star schema might suffer performance problems.
This can occur when a de-normalized dimension table becomes very large and penalizes
the star join operation. Conversely, sometimes a small outer-level dimension table does
not incur a significant join cost because it can be permanently stored in a memory buffer.
Furthermore, because a star structure exists at the center of a snowflake, an efficient star
join can be used to satisfy part of a query. Finally, some queries will not access data from
outer-level dimension tables. These queries effectively execute against a star schema that
contains smaller dimension tables. Therefore, under some circumstances, a snowflake
schema is more efficient than a star schema.
Ref: Handouts Page No. 105
Why both aggregation and summarization are required?
Although summarization and aggregation are sometimes used interchangeably
Summarization and aggregation are typically used for the following reasons:
They are required when the lowest level of detail stored in the data ware at a higher level than the detail arriving from the source. This situation of when data warehouse queries do not require the lowest level of detail or sometimes when sufficient disk space is not available to store all the time frame required by the data warehouse.
· They can be used to populate data marts from the data warehouse where mart does not require the same level of detail as is stored in the warehouse
· They can be used to roll up detail values when the detail is removed from warehouse because it is
Under what condition smart tools work properly to construct a less detailed aggregate
from more detailed aggregate?
Smart tools will allow less detailed aggregates to be constructed from aggregates (full aggregate awareness) at run-time so that we do not go all the way to the detail for every aggregation. However, for this to work, the metrics must be additive (e.g., no ratios, averages, etc.). More detailed pre-aggregates are larger, also be used to build less detailed aggregates on-the-go.
What is web scrapping? Give some of its uses.
Web scrapping is a process of applying screen scrapping techniques to the web. There
are several web scrapping products in the market and target business users who want to
creatively use the data, not write complex scripts. Some of the uses of scrapping are:
- Building contact lists
- Extracting product catalogs
- Aggregating real-estate info
- Automating search Ad listings
- Clipping news articles etc.
After completing the transformation task, data loading activity is started. How many
types of data loading strategies are and when each type of strategy is adopted? Explain.
Significance of Data Loading Strategies which is given below:-
- _ Need to look at:
- _ Data freshness
- _ System performance
- _ Data volatility
- _ Data Freshness
- _ Very fresh low update efficiency
- _ Historical data, high update efficiency
- _ Always trade-offs in the light of goals
- _ System performance
- _ Availability of staging table space
- _ Impact on query workload
- _ Data Volatility
- _ Ratio of new to historical data
- _ High percentages of data change (batch update)
How data is extracted from legacy system in system ?
Take jointly, the extract programs or nationally evolving system formed a spider web, also called " Legacy systems" architecture.
HOLAP features?
HOLAP: OLAP implemented as hybrid of MOLAP and ROLAP.
HOLAP provides a combination of relational database access and cube data structures within a single framework. The goal is to ge the best of both MOLAP and ROLAP. Scalability (via relational structures) and high performance (via pre-build cubes).
Some use of web scraping ? Page No. 146
Building contact lists extracting product catalogs aggregation real-estate info automating search articles etc.
Simple many-to-many element transformations ?
The most complex in many-to-many element transform good examples are house holding and individualization. This is achieved by using candidate key and fuzzy matching to determine which individuals are the same individuals, and which individuals go in the same household and so on, this is a very transformation.
Differentiate between ER and DM? Page No. 102
ER
Constituted to optimize OLTP performance Models the micro relationships among data elements A wild variability of the structure of ER models. Very vulnerable to changes in the user's querying habits, because such schemes are asymmetrical.
DM
Constituted to optimize DSS query performance. Models the macro relational among data elements with an overall deterministic strategy all dimensions server as equal entry points to the fact tables changes in user querying habits can catered by automatic QL generators.
Explain ORR's Law of Data Quality Page No,. 181
Law1 Data that is not used cannot be correct
Law2 Data quality is a function of its use, not collections
Law3 Data will be no better than its most string use
Law4 Data quality problem increase with the age system
Law5 The less likely something is to occur, the traumatic it will be when it happens.
What is the major transformation of Calculated and Derived Values?
Calculated and Derived Values:
We want to keep profit margin along with sales and cost amounts in your data warehouse tables? (Briefly touched in denormalization). The extracted data from the sales system contains sales amounts, sales units, and operating cost estimates by product. You will have to calculate the total cost and the profit margin before data can be stored in the data warehouse. Age and GPA are examples of derived fields
Ref: Handouts Page No. 153
Why bother about data duplication? Explain using an example.
Data duplication can result in costly errors, such as:
- False frequency distributions.
- Incorrect aggregates due to double counting.
Difficulty with catching fabricated identities by credit card companies and without accurate identification of duplicated information frequency distributions and various other aggregates will produce false or misleading statistics leading to perhaps untrustworthy new knowledge and bad decisions. Thus this has become an increasingly important and complex problem for many organizations that are in the process of establishing a data warehouse or updating the one already in existence.
Ref: Handouts Page No. 165,166
Briefly explain the Summarization basic Data Transformation task?
Basic tasks
- Selection
- Splitting/Joining
- Conversion
- Summarization
- Enrichment
Summarization: Sometimes you may find that it is not feasible to keep data at the lowest
level of detail in your data warehouse. It may be that none of your users ever need data at
the lowest granularity for analysis or querying. For example, for a grocery chain, sales
data at the lowest level of detail for every transaction at the checkout may not be needed.
Storing sales by product by store by day in the data warehouse may be quite adequate. So,
in this case, the data transformation function includes summarization of daily sales by
product and by store.
Ref: Handouts Page No. 135,136
Why ER Modeling has been so successful?
The ER modeling technique is a discipline used to highlight the microscopic relationships
among data elements or entities. The pinnacle of achievement of ER modeling is to remove all redundancy from the data. This is enormously beneficial for the OLTP systems, because transactions are made very simple and deterministic i.e. no surprises in the overall query execution time i.e. it must finish (say) under 5 seconds. The transaction for updating customer's address gets reduced to a single record lookup in a customer address master table. This lookup is controlled by a customer address key, which defines uniqueness of the customer address record i.e. PK. This lookup is implemented using an index, hence is extremely fast. It can be stated without hesitation, that the ER techniques have contributed to the success of the OLTP systems.
- Coupled with normalization drives out all the redundancy from the database.
- Change (or add or delete) the data at just one point.
- Can be used with indexing for very fast access.
- Resulted in success of OLTP systems.
Ref: Handouts Page No. 99
Define ETL. List three typical MIS/ERP systems that are found while doing ETL?
ETL stands for Extract Transform Load. ETL refers to the methods involved in accessing and manipulating source data and loading it into a data warehouse. The order in which these processes are performed varies. Note that ETT (extraction, transformation, transportation) and E T M (extraction, transformation, move) are sometimes used instead of ETL.
Measures of three quality dimensions, however, regards system quality as a functional
feature of the system itself and finds applying the ''ease of use'' factor to describe system quality somewhat problematic. therefore attached greater importance to criteria such as the system's response time and accuracy when measuring system quality.
Measures of two use dimensions The researchers always adopt TAM, explain and forecast users' behavior how new technology influence people's life. IS use is directly impacted
by behavioral intention (BI), a weighted function of attitude towards usage and perceived usefulness. Perceived usefulness and perceived ease of use determine attitudes toward usage.
General perceptual measures of net benefits of ERP The primary benefits expected to result from ERP are closely related to the level of integration that is promoted across functions in an enterprise. Expectations for improved business performance after adoption may result from both operational and strategic benefits . Some of the most significant intangible benefits included internal integration, improved information and processes, and improved customer service,
Ref: Handouts Page No. 129
http://paul-hadrien.info/backup/LSE/IS%20470/litterature%20review/success%20in%20Taiwan.pdf
What is meant by HOLAP? Why is it used?
The hybrid OLAP (HOLAP) solution is a mix of MOLAP and relational ROLAP
architectures that supports queries against summary and transaction data in an integrated fashion Comparing the use of MOLAP, HOLAP and ROLAP
The type of storage medium impacts on cube processing time, cube storage and cube browsing speed. Some of the factors that affect MOLAP storage are:
Ref: Handouts Page No. 96
http://www.geekinterview.com/question_details/271
Briefly describe features of MOLAP.
MOLAP on the other hand, requires the data to be computed beforehand. The data is
stored in a multidimensional array. HOLAP is a hybrid form of OLAP that is understood to combine the best features of ROLAP and MOLAP
Name of steps involved in cleaning of Data?
The basic steps for cleaning data are as follows:
1) Import the data from an external data source.
2) Create a backup copy of the original data in a separate workbook.
3) Ensure that the data is in a tabular format of rows and columns with: similar data in each column, all columns and rows visible, and no blank rows within the range. For best results, use an Excel table.
4) Do tasks that don't require column manipulation first, such as spell-checking or using the Find and Replace dialog box.
5) Next, do tasks that do require column manipulation. The general steps for manipulating a column
6) Add a formula that will transform the data at the top of the new column
Ref: http://office.microsoft.com/en-us/excel-help/top-ten-ways-to-clean-your-data-HA010221840.aspx
Briefly describe the features of STAR schema.
A logical structure that has a fact table in the center with dimension table radiating off of this central table. The fact table will generally be extremely large in comparison to its dimension tables. There will be a foreign key in the fact table for each dimension table
Ref: Handouts Page No. 105
Briefly describes, what is multi-pass BSN approach?
What are the benefits of HOLAP & DOLAP over MOLAP & ROLAP?
MOLAP
This is the traditional mode in OLAP analysis. In MOLAP data is stored in form of multidimensional cubes and not in relational databases. The advantages of this mode is that it provides excellent query performance and the cubes are built for fast data retrieval. All calculations are pre-generated when the cube is created and can be easily applied while querying data. The disadvantages of this model are that it can handle only a limited amount of data. Since all calculations have been pre-built when the cube was created, the cube cannot be derived from a large volume of data. This deficiency can be bypassed by including only summary level calculations while constructing the cube. This model also requires huge additional investment as cube technology is proprietary and the knowledge base may not exist in the organization.
ROLAP
The underlying data in this model is stored in relational databases. Since the data is stored in relational databases this model gives the appearance of traditional OLAP's slicing and dicing functionality. The advantages of this model is it can handle a large amount of data and can leverage all the functionalities of the relational database. The disadvantages are that the performance is slow and each ROLAP report is an SQL query with all the limitations of the genre. It is also limited by SQL functionalities. ROLAP vendors have tried to mitigate this problem by building into the tool out-of-the-box complex functions as well as providing the users with an ability to define their own functions.
HOLAP
HOLAP technology tries to combine the strengths of the above two models. For summary type information HOLAP leverages cube technology and for drilling down into details it uses the ROLAP model.
Comparing the use of MOLAP, HOLAP and ROLAP
The type of storage medium impacts on cube processing time, cube storage and cube browsing speed. Some of the factors that affect MOLAP storage are:
Cube browsing is the fastest when using MOLAP. This is so even in cases where no aggregations have been done. The data is stored in a compressed multidimensional format and can be accessed quickly than in the relational database. Browsing is very slow in ROLAP about the same in HOLAP. Processing time is slower in ROLAP, especially at higher levels of aggregation.
MOLAP storage takes up more space than HOLAP as data is copied and at very low levels of aggregation it takes up more room than ROLAP. ROLAP takes almost no storage space as data is not duplicated. However ROALP aggregations take up more space than MOLAP or HOLAP aggregations.
All data is stored in the cube in MOLAP and data can be viewed even when the original data source is not available. In ROLAP data cannot be viewed unless connected to the data source.
MOLAP can handle very limited data only as all data is stored in the cube.
What is the relationship between data quality and the value of a particular application when efforts for data quality are logical?
REF http://www.geekinterview.com/question_details/271
What is Data warehousing?
A data warehouse can be considered as a storage area where interest specific or relevant data is stored irrespective of the source. What actually is required to create a data warehouse can be considered as Data Warehousing. Data warehousing merges data from multiple sources into an easy and complete form. Data warehousing is a process of repository of electronic data of an organization. For the purpose of reporting and analysis, data warehousing is used. The essence concept of data warehousing is to provide data flow of architectural model from operational system to decision support environments
Ref: Handouts Page No. 10
Why Data Warehousing?
The world economy has moved from the industrial age into information driven
knowledge economy. The information age is characterized by the computer technology,
modern communication technology and Internet technology; all are popular in the world
today. Governments around the globe have realized potential of information, as a "multifactor" in the development of their economy, which not only creates wealth for the society, but also affects the future of the country. Thus, many countries in the world have placed the modern information technology into their strategic plans. They regard it as the most important strategic resource for the development their society, and are trying their best to reach and occupy the peak of the modern information driven knowledge economy.
Ref: Handouts Page No. 10
What are fact tables and dimension tables?
Fact table in a data warehouse consists of facts and/or measures. The nature of data in a fact table is usually numerical. On the other hand, dimension table in a data warehouse contains fields used to describe the data in fact tables. A dimension table can provide additional and descriptive information (dimension) of the field of a fact table e.g. If I want to know the number of resources used for a task, my fact table will store the actual measure (of resources) while my Dimension table will store the task and resource details. Hence, the relation between a fact and dimension table is one to many.
What is ETL process in data warehousing?
ETL is Extract Transform Load. It is a process of fetching data from different sources, converting the data into a consistent and clean form and load into the data warehouse.
Ref: Handouts Page No. 129
What is an OLTP system?
OLTP: Online Transaction and Processing helps and manages applications based on transactions involving high volume of data. Typical example of a transaction is commonly observed in Banks, Air tickets etc. Because OLTP uses client server architecture, it supports transactions to run cross a network.
Ref: Handouts Page No. 30
What is an OLAP system?
OLAP: Online analytical processing performs analysis of business data and provides the ability to perform complex calculations on usually low volumes of data. OLAP helps the user gain an insight on the data coming from different sources (multi dimensional).
Ref: Handouts Page No. 69
What are cubes?
Multi dimensional data is logically represented by Cubes in data warehousing. The dimension and the data are represented by the edge and the body of the cube respectively. OLAP environments view the data in the form of hierarchical cube. A cube typically includes the aggregations that are needed for business intelligence queries.
Explain sequence clustering algorithm.
Sequence clustering algorithm collects similar or related paths, sequences of data containing events e.g. Sequence clustering algorithm may help finding the path to store a product of "similar" nature in a retail ware house.
Ref: Handouts Page No. 164
Explain the difference between star and snowflake schemas.
Star schema: A highly de-normalized technique. A star schema has one fact table and is associated with numerous dimensions table and depicts a star. Snow flake schema: The normalized principles applied star schema is known as Snow flake schema. A dimension table can be associated with sub dimension table i.e. the dimension tables can be further broken down to sub dimensions.
Differences: A dimension table will not have parent table in star schema, whereas snow flake
schemas have one or more parent tables. The dimensional table itself consists of hierarchies of
dimensions in star schema, where as hierarchies are split into different tables in snow flake schema. The drilling down data from top most hierarchies to the lowermost hierarchies can be done.
Ref: Handouts Page No. 105
define onte-to-many transformation?
A one-to-many transformation is more complex than scalar transformation. As a data
element form the source system results in several columns in the DW. Consider the 6×30
address field (6 lines of 30 characters each), the requirement is to parse it into street
address lines 1 and 2, city, sate and zip code by applying a parsing algorithm.
Ref: Handouts Page No. 132
What is Data Cardinality?
Answer: Cardinality is the term used in database relations to denote the occurrences of data on either side of the relation. There are 3 basic types of cardinality: High data cardinality: Values of a data column are very uncommon. e.g.: email ids and the user names Normal data cardinality: Values of a data column are somewhat uncommon but never unique. e.g.: A data column containing LAST_NAME (there may be several entries of the same last name) Low data cardinality: Values of a data column are very usual. e.g.: flag statuses: 0/1 Determining data cardinality is a substantial aspect used in data modeling. This is used to determine the relationships
What is Data Scheme?
Data Scheme is a diagrammatic representation that illustrates data structures and data-relationships to each other in the relational database within the data warehouse. The data structures have their names defined with their data types. Data Schemes are handy guides for database and data warehouse implementation. The Data Scheme may or may not represent the real lay out of the database but just a structural representation of the physical database. Data Schemes are useful in troubleshooting databases.
What is data cube technology used for?
Data cube is a multi-dimensional structure. Data cube is a data abstraction to view aggregated data from a number of perspectives. The dimensions are aggregated as the 'measure' attribute, as the remaining dimensions are known as the 'feature' attributes. Data is viewed on a cube in a multidimensional manner. The aggregated and summarized facts of variables or attributes can be viewed. This is the requirement where OLAP plays a role.
Zindagi mein 2 Logo ka buhat khayal rahkoooo
2nd woh jiss ko tum ney har dukh me pukaara hoo (Mother)
--
Please visit http://vusr.net for Old and Latest Papers, Assignments, Quiz and GDBs.
http://VUSR.net The ultimate VU Study Resource
You received this message because you are subscribed to the Google
Groups "VUSR" group.
To unsubscribe
vusr+unsubscribe@googlegroups.com
To Visit Group Home Page
http://groups.google.com/group/vusr?hl=en?hl=en
No comments:
Post a Comment