Tuesday, July 8, 2014

Good Data Warehouse Links

These are places that I use as reference or research or just want to remember:


Mike 2.0 http://mike2.openmethodology.org/
MIKE2.0 which stands for Method for an Integrated Knowledge Environment, is an open source methodology for Enterprise Information Management that provides a framework for information development . The MIKE2.0 Methodology is part of the overall Open Methodology Framework.

Ralph Kimball: http://www.kimballgroup.com/
The Kimball Group is a focused team of senior consultants specializing in the design of effective data warehouses to deliver enhanced business intelligence. Through consulting, education, and writing, we help organizations leverage the information that’s collected by their operational systems to make better business decisions. We pioneered this industry. Although what we do is no longer unique, how we go about doing it most definitely is.

Bill Inmon: http://www.inmoncif.com/home/
Best known as the “Father of Data Warehousing”, Bill Inmon has become the most prolific and well-known author worldwide in the data warehousing and business intelligence arena. In addition to authoring more than 50 books and 650 articles, Bill has been a monthly columnist with the Business Intelligence Network, EIM Institute and Data Management Review. In 2007, Bill was named by Computerworld as one of the “Ten IT People Who Mattered in the Last 40 Years” of the computer profession.

TOGAF: http://www.opengroup.org/togaf/
The Open Group is a global consortium that enables the achievement of business objectives through IT standards. With more than 400 member organizations, we have a diverse membership that spans all sectors of the IT community

Zachman Framwork: http://www.zachman.com/
John A. Zachman is the originator of the “Framework for Enterprise Architecture” (The Zachman Framework™) which has received broad acceptance around the world as an integrative framework, an ontology for descriptive representations for Enterprises. Mr. Zachman is not only known for this work on Enterprise Architecture, but is also known for his early contributions to IBM’s Information Strategy methodology (Business Systems Planning) as well as to their Executive team planning techniques (Intensive Planning).

Agile Data: http://agiledata.org/
The Agile Data (AD) method defines a collection of strategies that IT professionals can apply in a wide variety of situations to work together effectively on the data aspects of software systems.  This isn’t to say that AD is a “one size fits all” methodology.  Instead, consider AD as a collection of techniques and philosophies that will enable IT professionals within your organization to work together effectively when it comes to the data aspects of software-based systems.

Data Warehousing Review: http://www.dwreview.com/
Data Warehouses are increasingly used by enterprises to increase efficiency and competitiveness. Using Scorecarding, Data Mining and OLAP analysis, business value can be extracted from Data Warehouses. Generating positive return-on-investments (ROI) from Data Warehouses requires a blend of  business intuitiveness and technical skills. This web site covers strategies and technologies that will enhance the ROI of Data Warehousing.

Data Administration: http://www.tdan.com/
Welcome to TDAN.com, the industry leading publication for people interested in learning about data administration and data management disciplines & best practices.
http://www.tdan.com/view-articles/4873/

Project Management: http://www.projectmanagement.com/
To make project managers more successful. ProjectManagement.com is the experience bridge that fills in the gaps--providing help to project managers in a number of ways. It is a community, your community, for project managers in Information Technology and other industries.
http://www.projectmanagement.com/content/processes/9076.cfm

1 Key Data: http://www.1keydata.com/
1Keydata.com offers free online programming tutorials and information on data warehousing.
http://www.1keydata.com/datawarehousing/datawarehouse.html
http://www.1keydata.com/datawarehousing/recipes-for-failure.html

BLOG: SQL Authority with Pinal Dave: http://blog.sqlauthority.com/
One of the best blogs on SQL coding in SQL Server.

BLOG: Data Warehousing and Business Intelligence: http://dwbi1.wordpress.com/
Awesome Blog about Data Warehousing and Business Intelligence
http://dwbi1.wordpress.com/2010/06/16/data-architect-data-warehouse-architect-and-bi-solution-architect/

Database Answers: http://www.databaseanswers.org/
Specialise in Data Modelling and we enjoy helping people to understand the power, the beauty and the economy of well-designed ERD Data Models.

Information Management: http://www.information-management.com/ 
Information Management www.info-mgmt.com is the educated reader’s choice for the latest news, commentary and feature content serving the information technology and business community. With its 2009 relaunch, Information Management has now grown readership and market share through interactive features, in-depth case studies and incisive thought leadership to become the trusted source for the data management and content communities. Information Management provides original reporting, online radio programming, informative Web seminar programming, white paper resources and online education tuned to the working needs of today’s information professional.

Data Warehouse Identifying Stakeholders

This is probably one of the most critical factors in Data Warehousing to identify who the stakeholders are and when do you need input from them at what stages?

Most projects get dragged along because the stakeholders and what they role is in the project were not clearly defined. I have some basic Rules I apply when creating a stakeholder matrix:

A stakeholder is a person who has a direct or indirect stake in the organisation. From a project point of view, stakeholders can be categorised by various responsibilities (SMART RACI + Matrix (Hartman 2001), (Hartman and Ashrafi 2004)):

Accountable – some stakeholders are accountable for the success of the entire project or a particular phase of the project, generally they are the departmental or divisional managers or;

Responsible – some stakeholders are responsible for the deliverables generally they have been charged by management to provide a definitive outcome from the project; or
Consultation – some stakeholders are used as consultants like an Subject Matter Expert (SME); or
Informed – some stakeholders need to be kept informed, generally management who needs to know where their investment dollars have ended up or external parties.

In a Data Warehouse implementation, it is generally the stakeholders that have requested the need for decision support. Therefore it is assumed that stakeholders will be key supporters of the project as they have most likely been the ones who have identified the initial requirements for the Data Warehouse. Stakeholders require tangible outcomes that they can measure, trust and quantify consolidated data. This is important to stakeholder(s) as this will provide them with the ability to follow through with the decision making process (Westerman 2001).

A project team will generally consist of:
  • a Project Sponsor;
  • the steering committee;
  • the review board;
  • the project manager;
  • the implementation team(s);
  • end users;
  • quality control.
But one of the questions that always appears is what is the defining roles, in my years of experience these roles are usually needed and must be identified:
  • Source Data Owners
  • Information Owners
  • Process Owners
  • Marketing
  • Legal & Security
  • Data Quality
  • Architecture
  • Infrastructure Owners
  • Data and Information Consumers
These user roles will have to sign off certain sections and will have to take ownership of that section in order for a proper implementation.

Let's look a each role and when they play a part in the normal life cycle of the data warehouse:

Source Data Owners: They are always in Data Warehouse projects as they will provide the data and also when there is fixes that needs to be done they will implement it on the source side so that the fix are reflected correctly through the process. Without them the project will fail, if the data that is supplied is not from the original owners, you might just get a penalty fee or the data flow might stop. They are usually heads of system or processes in a organisation. They will be involved throughout the life cycle. Sign-off is required on all documents.

Information Owners: I have made information owners as a separate role as they translate the raw data from the source into more understandable decision support systems, they are identified usually by the business rules that are applied to the data and they will also own the business rules. They are key to the project because they need to approve all the translation and business rules that are applied on the source data. They too are heads of the MIS teams. Their major involvement will be in the Business Requirements cycle. Sign-off are required on the requirements and end results (UAT)

Process Owners: The process owner is responsible for designing the processes necessary to achieve the objectives of the business plans that are created by the Business Leaders. The process owner is responsible for the creation, update and approval of documents (procedures, work instructions/protocols) to support the process. Many process owners are supported by a process improvement team. The process owner uses this team as a mechanism to help create a high performance process. The process owner is the only person who has authority to make changes in the process and manages the entire process improvement cycle to ensure performance effectiveness. This person is the contact person for all information related to the process. This person is accountable for the effectiveness of the process. They play a role in where the data and information touches the process, sometimes documents become obsolete because the process get automated through the data and information. The process owners are involved  in the Requirement and Design of a project. Sign-off are required on Requirements and design.

Marketing: This is a interesting factor in a data warehouse, but in major corporation there is a corporate identity that governs how the look and feel will be. Because there are reports and dashboards that end users will see and even outside clients an see the reports, marketing must be involved in the requirement and report design, they will govern the colours, layout, graphs logos, fonts and sizes of all the reports. Sign-off is required in the Requirement and Design phase.

Legal & Security: Their roles will be involved if there are legal ramification in exposing the data, this includes disclaimers on reports and what data and information consumers will be able to see. In South Africa there is the POPI Act that acts on the Personal Information broadly means any information relating to an identifiable, living natural person or juristic person (companies, CC’s etc.) and includes, but is not limited to: contact details: email, telephone, address etc.; demographic information: age, sex, race, birth date, ethnicity etc.; history: employment, financial, educational, criminal, medical history; biometric information: blood type etc.; opinions of and about the person; private correspondence etc. Sign-off is required on the requirements and UAT phases

Data Quality: Data quality owners are a definition in bigger organisations where the quality of the data and information cannot be alone controlled by the source data and information owners. I have separated them as the quality of data are becoming more and more important in organisations, sometimes there are different source systems with the same data and thus data quality experts must ensure that the true and best source of data quality will come from that system, they are also involved in master data management and also they are beneficial to have if it comes to rules that needs to be verified in the translation of data into information. Sign-off is optional on Requirements.

Data Warehouse Architects: They are involved in the initial stages of projects and data model design stages, this role will govern the implementation of projects. They play key roles in data warehouses as they need to ensure that the data warehouse complies with laws, and best practices and also to ensure that the data warehouse will be able to consume the new data and that overall integration and re usability of code will be used in the process of data integration. A data architect in Information Technology is a person responsible for ensuring that the data assets of an organization are supported by a data architecture that aids the organization in achieving its strategic goals. The data architecture should cover databases,data integration and the means to get to the data. Usually the data architect achieves his/her goals via setting enterprise data standards. A Data Architect can also be referred to as a Data Modeler, although the role involves much more than just creating data models.The work profile is related to Data Modeler. Sign-off is required in the Project initiation and the Design phases.

Infrastructure Owners: Their role might fall in with the Data Warehouse Architect, but for clarification I will separate them. They will be involved in the operations of the data warehouse as they will provide the hardware and software on which the solutions will be supplied on. Their involvement in project are more with capacity planning of total data warehouses. Sign-off is optional in design, but they will need to be notified of the data coming into the infrastructure and requirements for future.

Data and Information Consumers: These are the end users that will act upon the data and information either as operational response or making decision on the data, their input is valuable in a project as they will show the success of a project by the use of the reports and/or dashboards and/or data. The project is initiated to answer their needs and questions and to help them do their jobs better. Sign-off is required at the requirements and UAT phases and final acceptance in production.

Now that I have highlighted some of the roles in a project lets have a look at some questions in the requirements.

How will you identify the data warehouse project stakeholders?

This should be done by listing methods such as conducting brainstorming sessions or interviews, process modeling, identifying use cases and creating and analyzing surveys and questionnaires. Data warehouse project stakeholders will include any person, group of people or department that may be affected by the project. For example, a data warehouse project stakeholder could include members of the data warehouse analysis team, front-line managers who request specific reports from the information housed in the data warehouse or the information system engineers who design data queries.

What methods will you use to elicit business and stakeholder requirements?

This should be done by discussing requirements gathering techniques such as document analysis, conducting requirements workshops or focus groups, observing or job-shadowing individuals who interact with the system and process data warehouse information requests regularly, and diagramming interface analysis. Business requirements can include efficiency and cost reduction; stakeholder requirements can include the ease of system usability.

What domain knowledge or experience do you have with a data warehouse?

Experience with or knowledge of the domain, or the problem area undergoing analysis, may or may not be a requirement for the job. Some data warehouse systems are proprietary, or specific to a particular organization. Knowledge of the particular data warehouse system can be gained through study of manuals and discussion with subject matter experts.

How would you manage change requests in a data warehouse project?

This should be done by discussing the steps involved in managing change requests after project requirements have been defined. Steps should include establishing the process and wording for requesting a change, determining who has authority to authorize changes and analyzing the impacts of the change on the project, including budget and effort, the benefits and risks associated with the change and how the change will be prioritized and tracked.


References
Hartman, F. a. and R. Ashrafi (2004). "Development of the SMART Project Planning framework." International Journal of Project Management 22(6): 499-510.

Hartman, F. T. (2001). "Don't Park Your Brain Outside - A Practical Guide to Improving Shareholder Value with SMART Management." International Journal of Project Management 19(8): 486-488.

Keren, G. (1990). Cognitive aids and debiasing methods: Can cognitive pills cure cognitive ills. Cognitive biases J. P. Caverni, J. M. Fabre and M. Gonzalez. Amsterdam, North-Holland, Elsevier Science Publishers: 523-555.

Sauser, B. J., R. R. Reilly, &, et al. (2009). "Why projects fail? How contingency theory can provide new insights – A comparative analysis of NASA’s Mars Climate Orbiter loss." International Journal of Project Management Available online 14 February 2009: 15.

Westerman, P. (2001). Data Warehousing: using the Wal-Mart model, Academic Press.

Yeo, K. T. a. and F. Qiu (2003). "The value of management flexibility—a real option approach to investment evaluation." International Journal of Project Management 21: 243–250.

Zwikael, O., K. Shimizu, and, et al. (2005). "Cultural differences in project management capabilities: A field study." International Journal of Project Management 23: 454–462.

Sunday, March 16, 2014

Data Types or Attributes

If you are in data warehousing there will come a time to to data projections on tables. The basic to see on SQL Server you can use the following query:

SELECT * FROM sys.types

from there you will get a list of the data types:

image, text, uniqueidentifier, date, time, datetime2, datetimeoffset, tinyint, smallint, int, smalldatetime, real, money, datetime, float, sql_variant, ntext, bit, decimal, numeric, smallmoney, bigint, hierarchyid, geometry, geography, varbinary, varchar, binary, char, timestamp, nvarchar, nchar,
xml, sysname

Character data types













Here are a few general rules that should help:

  • Don't use nchar or nvarchar unless you truly need it. (Unicode provides a unique number for up to 65,536 characters. ANSI, the one most of us are most familiar with, has only 256.) Unless you're working with an international application, you probably don't need a Unicode data type.
  • Use the smallest data type necessary, but make sure it can accommodate the largest possible value.
  • Use a fixed-length data type when the values are mostly about the same size.
  • Use a variable length when the values vary a lot in size.
 Integer data types










Assigning the appropriate integer data type isn't as confusing as choosing a character data type. Simply use the smallest integer data type that accommodates the largest possible value.

Precision storage requirements





 





Float and real data type restrictions








The real data type is the same as float(24) -- a floating data type with 24 digits to the right of the decimal point

Smalldatetime and datetime restrictions



 









Smallmoney and money restrictions



Sunday, December 22, 2013

Best Practice on Data Warehouse Environments

When growing in Data Warehousing on of the most asked questions by infrastructure is what is the best practice on how many environments must be running?

Some assumptions that is made that in the Medium and Big Environments there is also a DR (Disaster Recover) in place.

Each organisation is different in regards to budget and user needs the following is recommendations of most common environments:

Small / New Environment:




In a small or new environment it is best to have only the production server separated, as the cost are then kept at a minimum and UAT deployment can happen a lot quicker. When using this environment please ensure that the following is kept:

Job runs:  Jobs should be disabled in DEV/UAT and only run when user request, else is can slow down the whole environment.
Backups: Backups should only be run on the Development Projects (Daily) and Production (Daily)
Syncing: Once a project moves to Production the Development and UAT instances must be deleted from  the environment, if there is any changes the production project must be copied over and used.

Medium / Mature Environment:
















Job runs:  Jobs should be disabled in DEV. QA can have a 3 instance run to test while in UAT phase, after this is Must be disabled.
Backups: Backups should run on the DEV (Daily), UAT (Monthly) and Production (Daily)
Syncing: Once a project moves to Production the Development instances must be deleted from  the environment, if there is any changes the production project must be copied over and used, from here normal process follows, the UAT should not be deleted as this is use to test the changes does not have an project version impact.

Big / High Performance Environment:

This environment is used for critical Data Warehouse system that can have big impacts on business. They are also very expensive, but very stable in the long run.















Job runs:  Jobs should be disabled in DEV. QA can have a 3 instance run to test while in UAT phase, after this is Must be disabled. PRE-PROD should have the same run as PROD. This is to ensure environment as a whole testing impact before it goes into PROD.
Backups: Backups should run on the DEV (Daily), UAT (Monthly), PRE-PROD (Montly) and Production (Daily)
Syncing: Once a project moves to Production the Development instances must be deleted from  the environment, if there is any changes the production project must be copied over and used, from here normal process follows, the UAT should not be deleted as this is use to test the changes does not have an project version impact. The PRE-PROD is used for full environment impact.

Wednesday, November 27, 2013

Data Warehousing

When I started in 2007 (6 Years ago) on Data Warehousing a terabyte of data was a lot, I had no idea what was in stored. I was amazed when 6 years later we already at exabytes! It seems that there is a quiet revolution in progress.
What is data warehousing?

Processor or Virtual Storage

· 1 Bit = Binary Digit
· 8 Bits = 1 Byte
· 1024 Bytes = 1 Kilobyte
· 1024 Kilobytes = 1 Megabyte
· 1024 Megabytes = 1 Gigabyte
· 1024 Gigabytes = 1 Terabyte
· 1024 Terabytes = 1 Petabyte
· 1024 Petabytes = 1 Exabyte
· 1024 Exabytes = 1 Zettabyte
· 1024 Zettabytes = 1 Yottabyte
· 1024 Yottabytes = 1 Brontobyte
· 1024 Brontobytes = 1 Geopbyte· 1 Bit = Binary Digit

Disk Storage

· 8 Bits = 1 Byte
· 1000 Bytes = 1 Kilobyte
· 1000 Kilobytes = 1 Megabyte
· 1000 Megabytes = 1 Gigabyte
· 1000 Gigabytes = 1 Terabyte
· 1000 Terabytes = 1 Petabyte
· 1000 Petabytes = 1 Exabyte
· 1000 Exabytes = 1 Zettabyte
· 1000 Zettabytes = 1 Yottabyte
· 1000 Yottabytes = 1 Brontobyte
· 1000 Brontobytes = 1 Geopbyte



Bit

A bit is a value of either a 1 or 0 (on or off).

Nibble

A Nibble is 4 bits.

Byte

A Byte is 8 bits.

Kilobyte (KB)

A Kilobyte is 1,024 bytes.

Megabyte (MB)

A Megabyte is 1,048,576 bytes or 1,024 Kilobytes

873 pages of plaintext (1,200 characters)
4 books (200 pages or 240,000 characters)
Gigabyte (GB)

A Gigabyte is 1,073,741,824 (230) bytes. 1,024 Megabytes, or 1,048,576 Kilobytes.

894,784 pages of plaintext (1,200 characters)
4,473 books (200 pages or 240,000 characters)
341 digital pictures (with 3MB average file size)
256 MP3 audio files (with 4MB average file size)
1 650MB CD
Terabyte (TB)

A Terabyte is 1,099,511,627,776 (240) bytes, 1,024 Gigabytes, or 1,048,576 Megabytes.

916,259,689 pages of plaintext (1,200 characters)
4,581,298 books (200 pages or 240,000 characters)
349,525 digital pictures (with 3MB average file size)
262,144 MP3 audio files (with 4MB average file size)
1,613 650MB CD's
233 4.38GB DVD's
40 25GB Blu-ray discs
Petabyte (PB)

A Petabyte is 1,125,899,906,842,624 (250) bytes, 1,024 Terabytes, or 1,048,576 Gigabytes.

938,249,922,368 pages of plaintext (1,200 characters)
4,691,249,611 books (200 pages or 240,000 characters)
357,913,941 digital pictures (with 3MB average file size)
268,435,456 MP3 audio files (with 4MB average file size)
1,651,910 650MB CD's
239,400 4.38GB DVD's
41,943 25GB Blu-ray discs
Exabyte (EB)

A Exabyte is 1,152,921,504,606,846,976 (260) bytes, 1,024 Petabytes, or 1,048,576 Terabytes.

960,767,920,505,705 pages of plaintext (1,200 characters)
4,803,839,602,528 books (200 pages or 240,000 characters)
366,503,875,925 digital pictures (with 3MB average file size)
274,877,906,944 MP3 audio files (with 4MB average file size)
1,691,556,350 650MB CD's
245,146,535 4.38GB DVD's
42,949,672 25GB Blu-ray discs
Zettabyte (ZB)

A Zettabyte is 1,180,591,620,717,411,303,424 (270) bytes, 1,024 Exabytes, or 1,048,576 Petabytes.

983,826,350,597,842,752 pages of plaintext (1,200 characters)
4,919,131,752,989,213 books (200 pages or 240,000 characters)
375,299,968,947,541 digital pictures (with 3MB average file size)
281,474,976,710,656 MP3 audio files (with 4MB average file size)
1,732,153,702,834 650MB CD's
251,030,052,003 4.38GB DVD's
43,980,465,111 25GB Blu-ray discs
Yottabyte (YB)

A Yottabyte is 1,208,925,819,614,629,174,706,176 (280) bytes, 1,024 Zettabytes, or 1,048,576 Exabytes.

1,007,438,183,012,190,978,921 pages of plaintext (1,200 characters)
5,037,190,915,060,954,894 books (200 pages or 240,000 characters)
384,307,168,202,282,325 digital pictures (with 3MB average file size)
288,230,376,151,711,744 MP3 audio files (with 4MB average file size)
1,773,725,391,702,841 650MB CD's
257,054,773,251,740 4.38GB DVD's
45,035,996,273,704 25GB Blu-ray discs

What is Data Warehousing


Like the normal warehousing terminology data warehousing is the storing of data generated by businesses or computers. Wikipedia defines it as follows:

"A data warehouse or enterprise data warehouse (DW, DWH, or EDW) is a database used for reporting and data analysis. It is a central repository of data which is created by integrating data from one or more disparate sources. Data warehouses store current as well as historical data and are used for creating trending reports for senior management reporting such as annual and quarterly comparisons."

"A data mart is a small data warehouse focused on a specific area of interest. Data warehouses can be subdivided into data marts for improved performance and ease of use within that area. Alternatively, an organization can create one or more data marts as first steps towards a larger and more complex enterprise data warehouse."

Data Warehousing is not just focusing on the storage of data but also the integration, accumulation, distribution and analysis of data. Each of these is a specialization on its own. (Data Integration / ETL , Data Modeling, Data storage / Database, Business Intelligence (Decision Support Systems) & Data Visualization )

The 2 fathers of data warehousing is Bill Inmon & Ralph Kimball.

Ralph KimballBill Inmon
Build business process oriented small data marts which are joined to each other using common dimensions between business process.One centralize data warehouse which will act as a enterprise-wide data warehouse and then build data mart as per need for specific department or process
It is known as bottom-up approachIt is known as top down approach
Data marts should be build on dimensional modelling approachCentral data warehouse to follow ER modelling approach


It is recommended to know how both of them work and design a solution for the business using the one that will suit the needs the best. I find merit in both of these two men philosophies, and used them in many DW solutions. Industry standard models usually follows Inmon's methodology, where Kimball's works best for Business Intelligence reporting.