various publications of foreign governments or of international organisations and their subsidiary organisations. technical and trade journals. books, magazines, and newspapers. reports and publications of various associations connected with business and industry, banks, stock exchanges, and other financial institutions. Unpublished data can be found in a variety of places, including diaries, letters, unpublished biographies, and autobiographies, as well as with scholars and research workers, trade associations, labour bureaus: other public and private individuals and organisations, among others. When using secondary data, the researcher must exercise extreme caution. In order to avoid this, he must conduct a thorough examination of the secondary data because it is possible that it is unsuitable or insufficient in the context of the problem that the researcher wishes to investigate. Dr. A.L. Bowley makes an excellent point in this regard, stating that it is never safe to take published statistics at face value without understanding their meaning and limitations, and that it is always necessary to critically evaluate arguments that can be based on these statistics. To be on the safe side, the researcher should ensure that secondary data has the following characteristics before using it: 1. Reliability of data: In order to determine the reliability of the data, it is necessary to determine the following: • Who collected the data? What were the data's sources of information? Were they gathered in accordance with proper procedures? What time did they arrive for collection? Did the compiler have any preconceived notions? Did the client specify a desired level of accuracy? Was it accomplished? 2. Suitability of data: When it comes to data, what is suitable for one inquiry may not necessarily be suitable for another inquiry, and vice versa. As a result, if the researcher discovers that the available data is unsuitable, the researcher should refrain from using them. In this context, the researcher must scrutinise the definitions of various terms and units of collection used at the time of data collection from the primary source, which must be done with great care. In a similar vein, the object, scope, and nature of the original investigation must be investigated. If the researcher discovers discrepancies between these, the data will be deemed unsuitable for the current investigation and should not be utilised. 201
3. Adequacy of data: If the level of accuracy achieved in data is found to be insufficient for the purposes of the current investigation, the data will be deemed insufficient and should not be used by the researcher in the future. It will also be considered inadequate if the data is related to a geographical area that is either narrower or wider than the geographical area under consideration. As a result of all of this, we can conclude that using data that is already available is extremely risky. The researcher should only use previously collected data if and when he believes it is reliable, appropriate, and sufficient. However, if such data are readily available from reliable sources and are also suitable and adequate, he should not dismiss them out of hand because it will not be cost effective to spend time and energy conducting field surveys to gather information in that case. It is possible that already available data contains a wealth of useful information that must be utilised by an intelligent researcher, but only after taking the necessary precautions. 11.10 SELECTION OF APPROPRIATE METHOD FOR DATA COLLECTION As a result, there are many different methods of data collection. Therefore, the researcher must carefully choose the method or methods for his or her own study, taking into consideration the following factors: 1. Nature, scope and object of enquiry: First and foremost, the nature, scope, and object of investigation must be considered when deciding on a particular method of investigation. In order to be effective, the method chosen must be appropriate for the type of investigation that will be conducted by the researcher. When deciding whether to use data that has already been collected (secondary data) or to collect data that has not yet been collected (primary data), this factor is also important. 2. Availability of funds: The availability of funds for the research project has a significant impact on the method that will be used for data collection. When a researcher's financial resources are severely limited, he will be forced to choose a method that is less efficient and effective than a more expensive method that is not as efficient and effective as the more expensive method. In practise, finance is a significant constraint, and the researcher must work within the constraints of his or her budget. 3. Time factor: A third factor to consider is the availability of time, which must be taken into consideration when selecting a particular method of data collection. Some methods require a significant 202
amount of time, whereas others allow for the collection of data in a significantly shorter amount of time. As a result, the amount of time available to the researcher influences the method that will be used to collect the information. 4. Precision required: Another important factor to consider when deciding on the method of data collection is the level of precision that is required. As a result, the most desirable approach in terms of method selection is determined by the nature of the particular problem, the amount of time and resources (money and personnel) available, as well as the desired level of precision. But above and beyond all of this, the researcher's ability and experience are critical factors to consider. It is very appropriate in this context to quote Dr. A.L. Bowley who says that \"common sense is the most important requirement, and experience is the most important teacher\" when it comes to the collection of statistical data. 11.11 CASE STUDY METHOD 11.11.1 Meaning: An extremely popular form of qualitative analysis, the case study method entails the meticulous and comprehensive observation of a social unit, which can be anything from a single individual or family to an entire institution or cultural group or even an entire community. It is a method of investigation that is focused on depth rather than breadth. The case study places a greater emphasis on the comprehensive examination of a small number of events or conditions, as well as their interrelationships. Specifically, the case study examines the processes that take place as well as their interrelationships. A case study is therefore fundamentally an in-depth investigation of the specific unit under consideration. The goal of the case study method is to identify the factors that are responsible for the behavior-patterns of a given unit as a whole, rather than as individual components. According to H. Odum, “The case study method is a technique by which individual factor whether it be an institution or just an episode in the life of an individual or a group is analysed in its relationship to any other in the group.” In this way, a fairly exhaustive study of a person (in terms of what they do and have done, what he believes they do and have done, what he expects to do and says he should do) or group is referred to as a \"life history\" or \"case history.\" For the case study method, Burgess refers to it as \"the social microscope\" in his writings. A case study, according to Pauline V. Young, is “an in-depth investigation of a social unit, whether that unit is comprised of a single individual, a group, a social institution, a district, or a community.” Generally speaking, the case study method is a type of qualitative analysis in which careful and complete observation of an individual, a situation, or an institution is carried out; efforts 203
are made to study each and every aspect of the observing unit in minute detail; and then generalisations and inferences are drawn from the data collected from the case study method cases. 11.11.2 Characteristics The following are some of the most important characteristics of the case study method: 1. Under this method, the researcher can choose a single social unit or a group of social units for his or her research purposes; he or she can even choose a situation to investigate in depth. 2. Here, the selected unit is studied intensively, that is, it is studied down to the smallest detail possible. In general, the study is conducted over a long period of time in order to ascertain the natural history of the unit and to gather sufficient information to allow for accurate inferences to be drawn. 3. Using this method, we conduct a thorough examination of the social unit, taking into consideration all of its aspects. We hope to gain a better understanding of the complex of factors that are at work within a social unit as a whole by using this method of investigation. 4. The approach used in this method is qualitative rather than quantitative in nature, as the name implies. There is no collection of purely quantitative information. Every effort is made to collect information on all aspects of life in order to provide the best service possible. As a result, case studies broaden our perception and provide us with a clearer understanding of life. For example, when conducting a case study on a criminal, we will not only look into how many crimes a man has committed, but we will also look into the factors that compelled him to commit those crimes. It is possible that the study's goal is to make recommendations for reforming the criminal justice system. 5. In the context of the case study method, an effort is made to understand the mutual interrelationship of causal factors and how they interact with one another. 6. The case study method examines the behaviour pattern of the unit in question directly rather than through an indirect and abstract approach as is common with other methods. 7. The case study method produces fruitful hypotheses, as well as data that may be useful in testing them, and as a result, it allows generalised knowledge to become richer and richer. Generalised social science may suffer as a result of the lack of this resource. 11.11.3 Evolution and scope The case study method is a systematic field research technique in sociology that is increasingly popular these days. In his studies of family budgets, Frederic Le Play is credited with introducing this method to the field of social investigation. He used it as a handmaiden 204
to statistics in his studies of family budgets. His comparative study of different cultures was the first to make use of case material, which was pioneered by Herbert Spencer. Dr. William Healy used this method in his research on juvenile delinquency, and he considered it to be a superior method over and above the use of statistical data alone. Similar to anthropologists and historians, novelists and dramatists have employed the technique when addressing issues pertaining to their respective fields of expertise. Several management problems are solved using case study methods, which are used by even the most seasoned management professionals. Briefly stated, the case study method is being used in a variety of disciplines. Not only that, but its popularity is growing by the day. 11.11.4 Assumptions In order for the case study method to work, several assumptions must be met. The following are some of the most significant assumptions: The assumption of consistency in fundamental human nature, despite the fact that human behaviour can vary depending on the situation. The assumption that the unit under consideration will be studied for its natural history. The assumption that the unit under consideration has undergone a thorough investigation. 11.11.5 Major phases involved The following are the major phases involved in a case study: Recognizing and determining the current status of the phenomenon under investigation or the unit of attention. The gathering of data, the examination of the phenomenon, and the study of its history. Diagnose and identification of causal factors as a basis for remedial or developmental treatment are included in this category. The implementation of corrective measures, such as treatment and therapy (this phase is often characterised as case work). A follow-up programme to assess the effectiveness of the treatment that has been implemented. 11.11.6 Advantages There are several advantages of the case study method that follow from the various characteristics outlined above. Mention may be made here of the important advantages. Because it is an exhaustive study of a social unit, the case study method allows us to gain a thorough understanding of the behaviour pattern of the unit in question. As a result of case studies, researchers are able to obtain an accurate and insightful record of personal experiences that reveals man's inner strivings and conflicts as well 205
as the motivations and forces that drive him to take action, as well as the forces that direct him to follow a particular pattern of behaviour. It is possible to trace the natural history of a social unit and its relationship to the social factors and forces at play in its surrounding environment using this method, which allows the researcher to trace the natural history of a social unit and its relationship to the social factors and forces at play in its surrounding environment. It assists in the formulation of relevant hypotheses, as well as the collection of data that may be useful in testing those hypotheses. Case studies, as a result, allow generalised knowledge to become increasingly rich and diverse. The method allows for a more in-depth study of social units than would otherwise be possible if we used either the observation method or the method of collecting information through schedules. It is for this reason that the case study method is so frequently employed, particularly in social research. In addition, the information gathered through the case study method is extremely beneficial to the researcher in the task of developing an appropriate questionnaire or schedule for the task in question, which necessitates extensive knowledge of the relevant universe. Depending on the circumstances, the researcher can employ one or more of the several research methods available under the case study method. In other words, the case study method allows for the use of a variety of methods such as in-depth interviews, questionnaires, documents, individual study reports, letters, and other similar tools to gather information. When it comes to determining the nature of the units to be studied as well as the nature of the universe, the case study method has proven to be extremely useful. The case study method is sometimes referred to as a \"mode of organizing data\" rather than the case study method itself for this reason. Because of the emphasis placed on historical analysis in this method, it is an excellent tool for understanding the past of a social unit. In addition, it is a technique for suggesting measures for improvement in the context of the current environment of the social units that are under consideration. As a real record of personal experiences, case studies are the ideal type of sociological material because they capture the attention of most skilled researchers using other techniques, whereas they do not. The case study method allows the researcher to gain more hands-on experience, which in turn improves his or her ability to analyse and problem solve. The study of social changes is made possible through the use of this method. The researcher can gain a clear understanding of social change in the past and present by conducting a detailed examination of the various aspects of a social unit. This also 206
makes it easier to draw inferences and keeps the research process on track, which is beneficial for both parties involved. In fact, it can be considered both the entry point and the final destination for abstract knowledge. Techniques such as case study analysis are essential for both therapeutic and administrative purposes. They are also extremely valuable when it comes to making decisions about a variety of management issues. Case studies are extremely useful for diagnosing and treating patients, as well as for solving other practical case problems. 11.11.7 Limitations Important limitations of the case study method may as well be highlighted. Case situations are seldom comparable and as such the information gathered in case studies is often not comparable. Since the subject under case study tells history in his own words, logical concepts and units of scientific classification have to be read into it or out of it by the investigator. Read Bain does not consider the case data as significant scientific data since they do not provide knowledge of the “impersonal, universal, non-ethical, non-practical, repetitive aspects of phenomena.” Real information is often not collected because the subjectivity of the researcher does enter in the collection of information in a case study. The danger of false generalisation is always there in view of the fact that no set rules are followed in collection of the information and only few units are studied. It consumes more time and requires lot of expenditure. More time is needed under case study method since one study the natural history cycles of social units and that too minutely. The case data are often vitiated because the subject, according to Read Bain, may write what he thinks the investigator wants; and the greater the rapport, the more subjective the whole process is. Case study method is based on several assumptions which may not be very realistic at times, and as such the usefulness of case data is always subject to doubt. Case study method can be used only in a limited sphere., it is not possible to use it in case of a big society. Sampling is also not possible under a case study method. Response of the investigator is an important limitation of the case study method. He often thinks that he has full knowledge of the unit and can himself answer about it. In case the same is not true, then consequences follow. In fact, this is more the fault of the researcher rather than that of the case method. 207
11.12 SUMMARY Data Collection is the process of gathering and measuring information on variables of interest in a systematic manner that allows one to answer stated research questions, test hypotheses, and assess outcomes There are several methods of collecting primary data, particularly in surveys and descriptive research. Important ones are: (i) observation method, (ii) interview method (iii) through questionnaires, (iv) through schedules, and (v)other methods which include warranty cards, distributor audits, pantry audits, consumer panels, using mechanical devices, through projective techniques, depth interviews, and content analysis. The interview method of data collection entails the presentation of oral-verbal stimuli, followed by a response expressed in terms of verbal responses. Personal interviews and, if at all possible, telephone interviews can be used to implement this method of data collection. A questionnaire is a collection of questions that are printed or typed in a specific order on a form or set of forms and distributed to respondents. Secondary data refers to information that has already been collected and analysed by another party, i.e., information that has already been made available to the public. Generally, published data can be found in the following places: various publications of the central, state, and local governments. various publications of foreign governments or of international organisations and their subsidiary organisations. technical and trade journals; books, magazines, and newspapers. reports and publications of various associations connected with business and industry, banks, stock exchanges, and other financial institutions. Unpublished data can be found in a variety of places, including diaries, letters, unpublished biographies, and autobiographies, as well as with scholars and research workers, trade associations, labour bureaus: other public and private individuals and organisations, among others. 11.13 KEYWORDS Data collection: It is the process of gathering and measuring information on variables of interest in a systematic manner that allows one to answer stated research questions, test hypotheses, and assess outcomes 208
Primary data: Primary data are those that are collected from scratch and for the first time, and are therefore considered to be unique in nature Structured observation: Structured observation is defined as an observation that is characterised by a careful definition of the units to be observed, a style of recording the observed information, standardised conditions of observation, and the selection of pertinent data of observation. Personal interviews: When using the personal interview method, it is necessary for a person known as the interviewer to ask questions to the other person or persons while in direct contact with them. Telephone interviews: It is necessary to contact respondents directly by telephone in order to collect information using this method. Warranty card: Warranty cards, which are typically postal-sized cards, are used by dealers of consumer durables to collect information about their products, such as product specifications 11.14 LEARNING ACTIVITY 1. “It is never safe to take published statistics at their face value without knowing their meaning and limitations.” Elucidate this statement by enumerating and explaining the various points which you would consider before using any published data. Illustrate your answer by examples wherever possible. 11.15 UNIT END QUESTIONS A. Descriptive Questions Short Questions 1. Write short notes on data collection. 2. List the major classification of data collection. 3. Define Interview method. 4. Write about Observation method. 5. What are the essentials of a good questionnaire? Long Questions 1. Describe the methods of data collection. 2. How to select appropriate data collection methods 3. Explain the data collection method through questionnaires 4. State the role of interview method in collecting the data 5. Distinguish between questionnaire and schedules B. Multiple Choice Questions 209
1. ______of data collection entails the presentation of oral-verbal stimuli, followed by a response expressed in terms of verbal responses a. Personal observation b. Interview method c. Survey method d. Case study method 2. ______is concerned with the broad underlying feelings or motivations of the individual, as well as the course of the individual's life experience, during the interview. a. Personal Interview b. Unstructured Interview c. Clinical Interview d. Non-directive Interview 3. These are typically postal-sized cards, are used by dealers of consumer durables to collect information about their products, such as product specifications. a. Warranty cards b. Guarantee cards c. Credit cards d. Debit cards 4. These are interviews that are designed to uncover underlying motives and desires. a. Stress Interviews b. Depth Interviews c. Personal Interviews d. Telephonic Interviews 5. ______is a collection of questions that are printed or typed in a specific order on a form or set of forms and distributed to respondents. a. Questionnaire b. Test c. Analysis sheet d. All of these Answers 1-b, 2-c, 3-a, 4-b, 5-b 210
11.16 REFERENCES References book R1, Business Research Methods – Alan Bryman& Emma Bell, Oxford University Press. R2, Research Methodology - C.R. Kothari R2, Statistics for Managers Using Microsoft Excel, Levine Stephan, Krehbiel Berenson Textbook references T1,SPSSExplained,ISBN:9780415274104,Publisher:TataMcgrawHill T2, Sancheti&Kapoor,BusinessMathematics,SultanChand,NewDelhi 211
UNIT 12: SAMPLINGANDTYPESOF SAMPLING STRUCTURE 12.0 Learning Objective 12.1 Meaning of Sample Design 12.2 Steps in Sample design 12.3 Criteria or selecting a sampling procedure 12.4 Characteristics of a good sample design 12.5 Different types of sample designs 12.5.1Non-probability sampling 12.5.2 Probability sampling 12.6 How to select a random sample? 12.7 Random Sample from an Infinite Universe 12.8 Complex Random Sampling Designs 12.8.1 Systematic sampling 12.8.2 Stratified sampling 12.8.3 Cluster sampling 12.8.4 Area sampling 12.8.5 Multi-stage sampling 12.8.6 Sampling with probability proportional to size 12.8.7 Sequential sampling 12.9 Summary 12.10 Keywords 12.11 Learning Activity 12.12 Unit End Questions 12.13 References 12.0 LEARNING OBJECTIVES After studying this unit, you will be able to: 212
Describe the steps in sample design. Identify the criteria for selecting a sampling procedure How to select a random sample? Explain the different types of complex random sampling designs Describe the different types of sample designs. 12.1 MEANING OF SAMPLE DESIGN A sample design is a predetermined strategy for selecting a representative sample from a given population. It refers to the technique or procedure that the researcher would use in order to select the items for inclusion in the sample. The sample design may also specify the number of items that will be included in the sample, which is referred to as the sample size. Before any data are collected, the sample design is determined. There are numerous sample designs from which a researcher can choose to use in his or her research. Some designs are more precise and easier to apply than others, whereas others are the opposite. It is the researcher's responsibility to select/prepare a sample design that is reliable and appropriate for his research study. 12.2 STEPS IN SAMPLE DESIGN In the process of developing a sampling design, the researcher must keep the following considerations in mind: (i) Type of universe: The first step in developing any sample design is to clearly define the collection of objects to be studied, which is technically referred to as the Universe. The universe can be either finite or infinite in its size. Unlike a finite universe, in which the total number of items is known, an infinite universe in which the total number of items is unknown; in other words, we have no way of knowing how many items there are in total. Cities' populations, the number of workers in a factory, and other such examples are examples of finite universes, whereas the number of stars in a sky, the number of listeners to a specific radio programme, the throwing of a dice, and other such examples of infinite universes are examples of infinite universes. (ii) Sampling unit: The sampling unit must be decided upon before the sample is selected for further analysis. The sampling unit may be a geographical unit such as a state, district, village, or other location, or it may be a construction unit such as a house, flat, or other location, or it may be a social unit such as a family, club, school, or other location, or it may be an individual participant in the study. The researcher will have to choose one or more of these units from among the many that are available for his or her study. (iii) Source list: 213
It is also referred to as a \"sampling frame,\" because it is the frame from which a sample is to be taken. It contains the names of all of the items that exist in a universe (in case of finite universe only). If a source list is not available, the researcher is responsible for putting one together. A comprehensive, accurate, dependable, and appropriate list should be included in the document. It is critical that the source list be as representative of the general population as possible, for a variety of reasons. (iv) Size of sample: This refers to the number of items that will be chosen from the universe in order to form a representative sample. This is a major problem that a researcher must deal with. There should be not excessively large or too small sample sizes used in this study. It should be at its best. When it comes to sampling, an optimal sample is one that meets all of the requirements for efficiency, representativeness, reliability, and flexibility. When determining the size of the sample, the researcher must take into consideration the desired precision as well as an acceptable level of confidence in the estimate. It is necessary to consider the size of the population variance because, in the case of a larger variance, a larger sample size is usually required. It is necessary to keep in mind the size of the population because this will also limit the sample size. When determining the size of a research study's sample, it is important to keep the parameters of interest in the study in mind. The size of the sample that we can draw is also dictated by the costs. Therefore, when determining the sample size, we must always keep in mind the constraints imposed by the budgetary constraints. (v) Parameters of interest: When determining the sample design, it is necessary to take into account the question of which specific population parameters are of interest to the researcher. For example, we might be interested in estimating the proportion of people in the population who have a particular characteristic, or we might be interested in knowing some average or other measure about the population. Aside from the general population, there may be important sub-groups within it about which we would like to make estimates. All of this has a significant influence on the sample design that we would accept. In practise, budgetary constraints have an enormous impact on decisions regarding not only the size of the sample but also the type of sample that should be used to collect data. This fact may even lead to the use of a non-probability sample as a result of the situation. (vi) Sampling procedure: The researcher must then choose the type of sample he will use, that is, he must choose the technique that will be used to select the items that will be included in the sample before proceeding. In fact, this technique or procedure serves as a representation of the sample design process itself. There are several sample designs (which are explained in greater detail in the following pages) from which the researcher must select one for his or her study. It goes 214
without saying that he must choose the design that has the lowest sampling error for a given sample size and cost. 12.3 CRITERIA OR SELECTING A SAMPLING PROCEDURE In this context, it is important to remember that a sampling analysis has two costs: the cost of collecting the data and the cost of making an incorrect inference based on the data collected. It is important for researchers to recognise the two main causes of incorrect inferences, namely, systematic bias and sampling error. A systematic bias results from errors in the sampling procedures and increasing the sample size will not be able to reduce or eliminate the bias completely. The best-case scenario is that the causes of these errors can be identified and corrected. Most of the time, a systematic bias is the result of one or more of the factors listed below: 1. Use of an insufficient sampling frame: Inappropriate sampling frames, that is, biased representations of the universe, will result in a systematic bias in the data. 2. Measurement device that is not working properly: Systematic bias will result if the measuring device is constantly out of tolerance for measurement error. A systematic bias can occur in survey work if the questionnaire or the interviewer has a preconceived bias. In a similar vein, if the physical measuring device is faulty, the data collected through such a measuring device will be subjected to systematic bias. 3. non-respondents are those who did not respond. Unless we have the means of sampling everyone who was initially included in the sample, we may encounter a systematic bias in our results. In such a situation, the likelihood of making contact with or receiving a response from a particular individual is frequently correlated with the measure of what is to be estimated, which explains why this is the case. 4. The principle of indeterminacy: It has been observed that individuals behave differently when they are under observation compared to when they are placed in non-observed situations on several occasions. As an example, if workers are aware that they are being observed during a work study, on the basis of which the average length of time to complete a task will be determined and, in accordance, the quota for piece work will be set, they will generally work more slowly than they would if they were not aware that someone is watching them. As a result, the indeterminacy principle could also be a contributing factor to systematic bias. 5. Inaccurate data reporting due to natural bias: 215
The natural bias of respondents in the reporting of data is frequently the source of a systematic bias in a large number of investigations. The income data collected by the government's taxation department is typically skewed downward, whereas the income data collected by some social organisations is typically skewed upward. When asked about their income for tax purposes, people in general understate their earnings, but when asked about their social status or affluence, they overstate their earnings. The majority of the time, people respond to psychological surveys by giving what they believe to be the \"correct\" answer rather than revealing their true feelings. Sampling errors are the random variations in sample estimates around the true population parameters that occur during the sampling process. Because they occur at random and have an equal chance of occurring in either direction, their nature happens to be of the compensatory variety, and the expected value of such errors happens to be equal to zero. It is known that sampling error decreases with an increase in the sample size, and it is of a smaller magnitude in the case of a homogeneous population. A sample design and size can be used to determine the amount of sampling error that will occur. The measurement of sampling error is referred to as the \"precision of the sampling plan\" in most instances. The precision of the results can be improved by increasing the sample size. The use of a larger sample size, on the other hand, has its own limitations, as a large sample size increases the cost of data collection while also increasing the likelihood of systematic bias. As a result, selecting a better sampling design with a smaller sampling error for a given sample size at a given cost is usually the most effective way to increase precision. The reality is that people prefer a less precise design because it is easier to implement and because systematic bias can be controlled more effectively in a less precise layout in practise. For the sake of simplicity, when selecting a sampling procedure, researchers must ensure that the procedure causes a relatively small sampling error and aids in the better control of systematic bias. 12.4 CHARACTERISTICS OF A GOOD SAMPLE DESIGN Given what has been said thus far, we can list the following characteristics of a good sample design as being beneficial: The sample design must result in a sample that is truly representative of the population. The sample design must be such that the sampling error is as small as possible. It is necessary that the sample design be feasible in light of the funds available for the research study. There must be some consideration given to sample design in order to better control systematic bias. 216
The sample size should be chosen in such a way that the results of the sample study can be applied to the entire universe with a reasonable level of confidence in general. 12.5 DIFFERENT TYPES OF SAMPLE DESIGNS There are different types of sample designs based on two factors viz., the representation basis and the element selection technique. On the representation basis, the sample may be probability sampling, or it may be non-probability sampling. Probability sampling is based on the concept of random selection, whereas non-probability sampling is ‘non-random’ sampling. On element selection basis, the sample may be either unrestricted or restricted. When each sample element is drawn individually from the population at large, then the sample so drawn is known as ‘unrestricted sample’, whereas all other forms of sampling are covered under the term ‘restricted sampling’. The following chart exhibits the sample designs as explained above. Thus, sample designs are basically of two types viz., non-probability sampling and probability sampling. We take up these two designs separately Figure 12.1 Chart showing basic sample design 12.5.1 Non-probability sampling: Non-probability sampling is a sampling procedure that does not provide a basis for estimating the likelihood that each item in the population will be included in the sample. Non- probability sampling is also referred to as non-probability selection. Non-probability sampling is also referred to by a variety of other names, including deliberate sampling, purposive sampling, and judgement sampling among others. In this type of sampling, the items for the sample are deliberately chosen by the researcher, and his decision on which items to include in the sample remains final. In other words, under non-probability sampling, the investigators purposefully choose the particular units of the universe that will be used to form a sample on the assumption that the 217
small mass that they choose out of a large one will be representative of the entire universe. A few towns and villages may be purposefully selected for intensive study on the basis of the assumption that they are representative of the entire state, if economic conditions of people living in a state are to be investigated. As a result, the judgement of the study's organizers is critical in the design of the sampling strategy. It is very likely that the personal element will be considered in the sample selection process when using this design. The investigator may choose a sample that will produce results that are favourable to his or her point of view, and if this occurs, the entire investigation may be thrown into disarray. As a result, there is always the possibility that bias will creep into this type of sampling technique. Nevertheless, if the investigators are unbiased, work without bias, and have the necessary experience to make sound judgments, the results obtained from an analysis of a deliberately selected sample may be tolerably reliable in some cases. A sampling procedure, on the other hand, does not guarantee that every element has a reasonable chance of being included in the sample. As a result, it is impossible to estimate the sampling error in this type of sampling, and there is always an element of bias present, no matter how small. As a result, this sampling design is rarely used in large-scale, high-profile investigations. This design, on the other hand, may be used in small inquiries and research projects conducted by individuals because of the relative advantages in time and money that this method of sampling provides. Quota sampling is another type of non-probability sampling that is used. Quota sampling is a method in which interviewers are simply given quotas to fill from various strata, with some restrictions on how those quotas are to be filled. In other words, the actual selection of the items for the sample is left entirely up to the discretion of the interviewer. This type of sampling is very convenient, and it is also reasonably priced. However, the samples that have been chosen in this manner do not exhibit the characteristics of random samples. When it comes to quota samples, they are essentially judgement samples, and the inferences drawn from them are not amenable to formal statistical treatment. 12.5.2 Probability sampling: A variation on this term is called 'random sampling' or simply 'chance sampling. This sampling design ensures that each and every item in the universe has an equal chance of being included in the sample. It is, in a sense, a lottery method in which individual units are selected from the entire group by a mechanical process rather than being chosen consciously. In this case, the choice between one item and another is solely determined by random chance. The results obtained from probability or random sampling can be assured in terms of probability, i.e., we can measure the errors of estimation, or the significance of results obtained from a random sample, and this fact demonstrates the superiority of random sampling design over deliberate sampling design in terms of accuracy and reliability. Random sampling ensures that the law of Statistical Regularity is followed, which states that if the sample chosen is on average a random sample, the sample will have the same 218
composition and characteristics as the universe on the whole. The reason for this is that random sampling is considered to be the most effective technique for selecting a representative sample. With regard to sampling from a finite population, random sampling refers to a technique in which each possible sample combination is given an equal chance of being chosen and each item in the entire population is given an equal chance of being selected for inclusion in the sample pool. This applies to sampling without replacement, which means that once an item has been selected for the sample, it will not be able to appear in the sample in the future (Sampling with replacement is used less frequently in which procedure the element selected for the sample is returned to the population before the next element is selected. In such a situation the same element could appear twice in the same sample before the second element is chosen). In a nutshell, the consequences of random sampling (or simple random sampling) are as follows: (i) This method ensures that each element of the population has an equal chance of being included in the sample, and that all choices are independent of one another. (ii) The probability of being chosen for each possible sample combination is equal for all of them. We can define a simple random sample (also known as a simple random sample) from a finite population as a sample that is chosen in such a way that each of the NCn possible samples has the same probability, 1/NCn, of being chosen. To make it clearer, consider a finite population consisting of six elements (say, a, b, c, d, e, and f), i.e., N = 6. In order to make it clearer, consider the following: Consider the following scenario: we want to draw a sample of size n = 3 from it. Therefore, for the required size, there are 6C3 = 20 distinct samples that are possible, and they are composed of the elements abc, abd, abe, abf, acd, ace, acf, ade, adf, aef, bcd, bce, bcf, bde, bdf, bef, cde, cdf, cef, and def. In the case where we choose one of these samples in such a way that each has a 1/20 chance of being chosen, we will refer to this as a random sample. 12.6 HOW TO SELECT A RANDOM SAMPLE? As for how a random sample is taken in practise, for simple cases such as the one described above, we could write each of the possible samples on a slip of paper, thoroughly mix the slips in a container, and then draw the samples as in a lottery either blindfolded or by rotating a drum or by using another similar device. In the case of complex sampling problems, such a procedure is obviously impractical, if not completely impossible. In actuality, the practical utility of such a method is extremely limited in practise. The good news is that we can take a random sample in a relatively simple manner, rather than going through the trouble of enumerating all possible samples on paper slips, as described above. An alternative method is to handwrite on a slip of paper the names of each element of 219
the finite population, put the slips of paper so prepared into a box or a bag and thoroughly mixed together, and then draw (without looking) the required number of slips for the sample one after the other without replacing the previous slips. In order to accomplish this, we must ensure that each of the remaining elements of the population has the same chance of being selected in each successive drawing. This procedure will also result in the same probability for each possible sample as a result of the previous procedure. Using the previous example, we can see that this is true. For example, given a finite population of six elements and a goal of selecting three elements from it for our sample, the probability of drawing any one element for our sample in the first draw is 3/6, the probability of drawing one more element in the second draw is 2/5 (the first element drawn is not replaced), and the likelihood of drawing one more element in the third draw is 1/4. This is because each of the three elements in our sample is drawn independently, so the combined probability of the three elements that make up our sample is the product of their individual probabilities, which comes out to 3/6 2/5 1/4 = 1/20. This confirms the accuracy of our earlier calculation. When it comes to actual practise, even this relatively simple method of obtaining a random sample can be made even simpler by the use of random number tables. Various statisticians, including Tippett, Yates, and Fisher, have created tables of random numbers that can be used to select a random sample from a larger pool of data. For the most part, random number tables developed by Tippett are used for this purpose. Tippett came up with 10400 four-digit numbers. He selected 41600 digits from the census reports and grouped them into fours to produce his random numbers, which can be used to generate a random sample of a larger population. An illustration of the procedure can be provided by way of an example. To begin, we reproduce the first thirty sets of Tippett's numbers in their entirety. 2952 6641 3992 9792 7979 5911 3170 5624 4167 9525 1545 1396 7203 5356 1300 2693 2370 7483 3408 2769 3563 6107 6913 7691 0560 5246 1112 9025 6008 8126 Consider the following scenario: we are interested in selecting a sample of 10 units from a population of 5000 units with numbers ranging from 3001 to 8000. We will choose ten figures from the above random numbers that are neither less than 3001 nor greater than 8000 in value. If we choose at random to read the table numbers from left to right, beginning with the first row itself, we obtain the following numbers: 6641, 3992, 7979, 5911, 3170, 5624, 4167, 7203, 5356, and 7483. We can also choose at random to read the table numbers from right to left, starting with the first row itself. 220
The units with the above-mentioned serial numbers would then be used to create the random sample that we required. Observers may have noticed that drawing random samples from finite populations with the aid of random number tables is simple only when lists are available, and items can be easily identified by their numbers. However, in some situations, it is frequently not possible to proceed in the manner described above. In the case of estimating the mean height of trees in a forest, it would not be possible to number the trees and choose random numbers to select a random sample because the trees would be too close together. In such cases, we should choose some trees for the sample in a haphazard manner, with no specific goal or purpose in mind, and we should treat the sample as if it were a random sample for research purposes. 12.7 RANDOM SAMPLE FROM AN INFINITE UNIVERSE Up to this point, we've talked about random sampling, with the assumption that only finite populations are involved. But what about random sampling in the context of an infinite number of individuals? When it comes to explaining the concept of a random sample from an infinite population, it can be challenging. However, a few examples will demonstrate the fundamental characteristics of a sample of this nature. For example, let us assume that we are considering the results of the 20 fair dice throws as a representative sample from a hypothetically infinite population that contains the results of all possible dice throws. A random sample is defined as one in which the probability of getting a specific number, such as 1, is the same for each throw and the 20 throws are all independent of one another. As an example, if we sample with replacement from a finite population, we would be considered to be sampling from an infinite population, and our sample would be considered to be a random sample if, in each draw, all elements of the population have the same probability of being selected and the draws are not dependent on one another (i.e., they are not independent). The selection of each item in a random sample from an infinite population is controlled by the same probabilities for each item, and successive selections are independent of one another, to put it succinctly 12.8 COMPLEX RANDOM SAMPLING DESIGNS It has been demonstrated that, when performing probability sampling under restricted sampling techniques, complex random sampling designs can be produced. For the purpose of selecting a sample, such designs may as well be referred to as \"mixed sampling designs,\" because many of them may represent a combination of probability and non-probability sampling procedures in the selection of a sample. The following are some of the most popular complex random sampling designs in use today: 221
12.8.1 Systematic sampling In some cases, the most practical method of sampling is to choose every ith item from a list of possibilities. Systematic sampling is the term used to describe this type of sampling. The introduction of a random element into this type of sampling is accomplished through the use of random numbers to select the unit with which to begin. Consider the following scenario: If a 4-percent sample is desired, the first item would be chosen at random from the first twenty- five items, and every subsequent item would be automatically included in the sample. As a result, in systematic sampling, only the first unit of the sample is selected at random, with the remaining units of the sample being selected at predetermined intervals thereafter. Despite the fact that a systematic sample is not a random sample in the strictest sense of the term, it is frequently considered reasonable to treat a systematic sample as if it were a random sample in practise. There are some advantages to using a systematic sampling method. It can be considered an improvement over a simple random sample in the sense that the systematic sample is distributed more evenly across the entire population than a simple random sample. It is a less time-consuming and less expensive method of sampling that can be used even in large populations with relative ease and convenience. However, there are some risks associated with the use of this type of sampling. If there is a hidden periodicity in a population, systematic sampling will prove to be an inefficient method of sampling that will not yield the desired results. For example, every 25th item produced by a particular manufacturing process is faulty in some way. According to the random starting position, if we were to randomly select a 4 percent sample of the items from this process in a systematic manner, we would either end up with all defective items or all good items in our sample, respectively. Unless all elements of the universe are ordered in a manner that is representative of the total population, i.e., unless the population list is in random order, systematic sampling is regarded as the same as random sampling. However, if this is not the case, the results of such sampling may not always be accurate or reliable at all. As a practical matter, systematic sampling is used when lists of the general population exist, and the lists are of a significant length. 12.8.2 Stratified sampling The stratified sampling technique is generally used to obtain a representative sample when the population from which a sample is to be drawn does not consist of a homogeneous group. Stratified sampling is a technique in which a population is divided into several sub- populations that are individually more homogeneous than the total population (the different sub-populations are referred to as \"strata\"), and then items from each stratum are selected to form a sample. Because each stratum is more homogeneous than the total population, we are able to obtain more precise estimates for each stratum, and by estimating each of the component parts more accurately, we are able to obtain a more accurate estimate of the entire 222
population. Overall, stratified sampling provides more reliable and detailed information than random sampling. In the context of stratified sampling, the following three questions are extremely important to consider: (a) How to form strata? (b) What criteria should be used in selecting items from each stratum? (c) How many items be selected from each stratum or how to allocate the sample size of each stratum? In response to the first question, we can say that the strata will be formed on the basis of the common characteristic(s) of the items that will be placed in each stratum of classification. Thus, various strata should form in such a way that elements are the most homogeneous within each stratum and the most heterogeneous between the various strata. As a result, strata are formed on purpose and are typically based on the researcher's previous experience and personal judgement. Always keep in mind that stratification is normally accomplished through careful consideration of the relationship between the characteristics of the population and the characteristics to be estimated. A pilot study may be conducted in order to determine a more appropriate and efficient stratification plan in certain circumstances. In order to do so, we must first collect small samples of equal size from each of the proposed strata, and then examine the variances within and between the possible stratifications. Only then can we determine an appropriate stratification plan for our investigation. For the second question, we can say that the most commonly used method for selecting items for the sample from each stratum is simple random sampling, which can be summarised as follows: For example, if systematic sampling is considered more appropriate in certain situations, it can be used. The third question is addressed by the proportional allocation method, according to which the sizes of the samples from different strata are kept proportional to the sizes of the strata in each case. The number of elements selected from stratum I is equal to n Pi, where Pi represents the proportion of a population included in stratum I and n represents the total sample size. Let us consider the following scenario: we want to draw a sample of size n = 30 from a population of size N = 8000 that has been divided into three strata of sizes N1 = 4000, N2 = 2400, and N3 = 1600. By using proportional allocation, we will be able to obtain the sample sizes listed below for the various strata: As an example, for strata with N1 = 4000, we have the ratio P1 = 4000/8000. as a result, n1 = n. P1 = 30 (4000/8000) = 15 (4000/8000) 223
As an example, we have strata with N2 = 2400 where we have In addition, P2 = 30 (2400/8000) = 9 and n2 =n. For strata with N3 = 1600, we have the following: n3 = n . P3 = 30 (1600/8000) = 6. As a result, using proportional allocation, the sample sizes for the different strata are 15, 9, and 6, respectively, which is proportional to the sizes of the strata, which are 4000 : 2400 : 1600 respectively. Proportional allocation is considered to be the most efficient and optimal design method available. There is no difference in within-stratum variances when the cost of selecting an item is equal across all strata, and the purpose of sampling is to estimate the population value of a characteristic when the cost of selecting an item is equal across all strata. However, if the goal is to compare the differences between the strata, then equal sample selection from each stratum would be more efficient, even if the strata are of varying sizes in the first place. In cases where strata differ not only in size but also in variability, and it is considered reasonable to take larger samples from the more variable strata and smaller samples from the less variable strata, we can account for both (differences in stratum size and differences in stratum variability) by employing a disproportionate sampling design, which requires the following conditions to be satisfied: 12.8.3 Cluster sampling A convenient method for taking a sample when the total area of interest is large is to divide the area into a number of smaller non-overlapping areas and then randomly select a number of these smaller areas (which are commonly referred to as clusters), with the final sample consisting of all (or samples of) units within these small areas or clusters. The total population is divided into a number of relatively small subdivisions, each of which is comprised of even smaller units, and then some of these subdivisions are randomly selected to be included in the overall sample in a process known as cluster sampling. Consider the following scenario: we want to estimate the proportion of machine parts a number of items in an inventory that are defective In addition, assume that there are 20000 machine parts in the inventory at any given time, which are organised into 400 cases of 50 224
pieces each. Using cluster sampling, we would consider the 400 cases to be clusters, and then randomly select ‘n' cases from each cluster, examining all of the machine parts in each randomly selected case, and so on. Cluster sampling, without a doubt, saves money by concentrating surveys in a small number of selected clusters. However, it is unquestionably less precise than random sampling. The information contained within a cluster of observations is also less than the amount of information contained within the same number of observations drawn at random. Cluster sampling is used solely for the economic advantage it provides; estimates based on cluster samples are typically more reliable per unit cost than estimates based on random samples. 12.8.4 Area sampling If clusters happen to be some geographical subdivisions, cluster sampling is referred to as area sampling in this case instead of cluster sampling. In other words, area sampling is distinguished from cluster designs in which the primary sampling unit represents a cluster of units based on geographic location. Aspects of cluster sampling that are advantageous and disadvantageous are also applicable to area sampling. 12.8.5 Multi-stage sampling Multi-stage sampling is a development of the principle of cluster sampling that is used in the first place. Consider the following scenario: we want to investigate the working efficiency of nationalized banks in India, and we want to select a sample of a few banks to conduct this investigation. The first step is to choose a large primary sampling unit, such as a country's states, as the starting point. Then we can choose specific districts and conduct interviews with all of the banks in those districts. This would be a two-stage sampling design, with the ultimate sampling units being clusters of districts as the final sampling units. If, instead of conducting a census of all banks within the selected districts, we select specific towns and interview all banks within those towns, the results will be more accurate. This would be equivalent to a three-stage sampling strategy. A four-stage sampling plan is used when we randomly sample banks from each of the selected towns, rather than taking a census of all banks within each of the selected towns. In the case of a multi-stage random sampling design, we will select at random at each stage of the process. When conducting large-scale investigations spanning a significant amount of geographical territory, such as the entire country, multi-stage sampling is typically used. There are two advantages to using this sampling design, which are as follows: a) In comparison to most single stage designs, multi-stage sampling is easier to administer, owing to the fact that the sampling frame under multi-stage sampling is constructed in partial units. b) Because of sequential clustering, a large number of units can be sampled for a given cost under multistage sampling, whereas this is not possible in the majority of simple designs. 225
12.8.6 Sampling with probability proportional to size The use of a random selection process is considered appropriate when the cluster sampling units do not have the same number or approximately the same number of elements. In this case, the probability of each cluster being included in the sample is proportional to the size of the cluster. In order to accomplish this, we must list the number of elements contained within each cluster, regardless of the method used to arrange the clusters. Then we must select an appropriate number of elements from the cumulative totals by sampling them in a systematic manner. Individual elements are not represented by the numbers chosen in this manner; rather, they serve to indicate which clusters are to be sampled and how many individuals from each cluster are to be selected either by simple random sampling or by systematic sampling. It is equivalent in results to a simple random sample, and it is less cumbersome as well as being relatively less expensive than the simple random sample method. 12.8.7 Sequential sampling Compared to other sampling designs, this one is a little more complicated. When using this technique, the final size of the sample is not determined in advance, but rather is determined by mathematical decision rules based on the information gathered as the survey progresses, rather than by chance. The acceptance sampling plan is typically used in the context of statistical quality control, and this is the method most often used. It is known as single sampling when a particular lot is to be accepted or rejected on the basis of a single sample; it is known as double sampling when the decision is to be made on the basis of two samples; and in the case of more than two samples, but the number of samples is known and decided in advance, the sampling is known as multiple sampling when the decision is to be made on the basis of multiple samples. While there is no certainty or predetermination as to the number of samples to be taken, this type of system is commonly referred to as sequential sampling when the number of samples to be taken is greater than two. Briefly stated, sequential sampling allows one to continue taking samples one after another for as long as one wishes while maintaining a consistent sampling strategy. 12.9 SUMMARY A sample design is a predetermined strategy for selecting a representative sample from a given population. It refers to the technique or procedure that the researcher would use in order to select the items for inclusion in the sample In the process of developing a sampling design, the researcher must keep the following considerations in mind: Type of universe 226
Sampling unit Source list Size of sample Parameters of interest Sampling procedure The sample design must result in a sample that is truly representative of the population. The sample design must be such that the sampling error is as small as possible. There are different types of sample designs based on two factors viz., the representation basis and the element selection technique Non-probability sampling is a sampling procedure that does not provide a basis for estimating the likelihood that each item in the population will be included in the sample. Non-probability sampling is also referred to as non-probability selection. Quota Sampling: Quota sampling is a method in which interviewers are simply given quotas to fill from various strata, with some restrictions on how those quotas are to be filled. A random sample is defined as one in which the probability of getting a specific number, such as 1, is the same for each throw and the 20 throws are all independent of one another 12.10 KEYWORDS Sample design: A sample design is a predetermined strategy for selecting a representative sample from a given population. Non-probability sampling: Non-probability sampling is a sampling procedure that does not provide a basis for estimating the likelihood that each item in the population will be included in the sample. Non-probability sampling is also referred to as non- probability selection. Probability sampling: This sampling design ensures that each and every item in the universe has an equal chance of being included in the sample Random sample: A random sample is defined as one in which the probability of getting a specific number, such as 1, is the same for each throw and the 20 throws are all independent of one another Stratified sample: The stratified sampling technique is generally used to obtain a representative sample when the population from which a sample is to be drawn does not consist of a homogeneous group 12.11 LEARNING ACTIVITY 1. A data processing analyst for a research supplier finds that preliminary computer runs of survey results show that consumers love a client’s new product. The employee buys a large block of the client’s stock. Is this ethical? 227
12.12 UNIT END QUESTIONS A. Descriptive Questions Short Questions 1. Define Sample design. 2. How to select a random sample? 3. List the different types of sample designs. 4. How to select a random sample from an infinite universe? 5. Define sampling unit. Long Questions 1. Describe the steps in sample design. 2. Identify the criteria for selecting a sampling procedure 3. How to select a random sample? 4. Explain the different types of complex random sampling designs 5. Describe the different types of sample designs. B. Multiple Choice Questions 1. The sample design may also specify the number of items that will be included in the sample, which is referred to as the ______ a. Sample number b. Sample plan c. Sample size d. All of these 2. ______may be a geographical unit such as a state, district, village, or other location, or it may be a construction unit such as a house, flat, or other location, or it may be a social unit such as a family, club, school, or other location, or it may be an individual participant in the study a. Sampling unit b. Sample area c. Sample place d. Sampling location 3. It is also referred to as a \"sampling frame”. a. Sample place b. Source list c. Sample parameters d. Sampling universe 228
4. The investigators purposefully choose the particular units of the universe that will be used to form a sample on the assumption that the small mass that they choose out of a large one will be representative of the entire universe. a. Probability sampling b. Non probability sampling c. Cluster sampling d. Quota sampling 5. ______is a method in which interviewers are simply given quotas to fill from various strata, with some restrictions on how those quotas are to be filled. a. Probability sampling b. Non probability sampling c. Cluster sampling d. Quota sampling Answers 1-c, 2-a, 3-b, 4-b, 5-d 12.13 REFERENCES References book R1, Business Research Methods – Alan Bryman& Emma Bell, Oxford University Press. R2, Research Methodology - C.R. Kothari R2, Statistics for Managers Using Microsoft Excel, Levine Stephan, Krehbiel Berenson Textbook references T1,SPSSExplained,ISBN:9780415274104,Publisher:TataMcgrawHill T2, Sancheti&Kapoor,BusinessMathematics,SultanChand,NewDelhi 229
UNIT 13: DATA ANALYSIS STRUCTURE 13.0 Learning Objectives 13.1 Meaning of Sample Design 13.2 Stages of Data analysis 13.2.1 Editing 13.2.2 Coding 13.2.3 The Data file 13.2.4 Analysis approach 13.3 Data entry 13.3.1 Alternative Data Entry Formats 13.2.2 On the Horizon 13.4 Exploratory Data Analysis 13.4.1 Frequency Tables, 13.4.2Bar Charts, and Pie 13.4.3 Histograms 13.4.4 Stem-and-Leaf Displays 13.4.5 Pareto Diagrams 13.4.6 Boxplots 13.5 Cross-Tabulation 13.5.1The Use of Percentages 13.5.2 Other Table-Based Analysis 13.6 Data Transformation 13.7 Statistics in Research 13.7.1 Measures of Central tendency 13.7.2 Measures of Dispersion 13.7.3 Measures Of Asymmetry (Skewness) 13.7.4 Measures of Relationship 13.7.5 Other measures 230
13.8 Summary 13.9 Keywords 13.10 Learning Activity 13.11 Unit End Questions 13.12 References 13.0 LEARNING OBJECTIVES After studying this unit, you will be able to: Describe the stages of data analysis. Explain Cross-Tabulation. How to feed the data and perform the analysis? Describe the Statistics in Research. Explain exploratory data analysis. 13.1 MEANING OF SAMPLE DESIGN A sample design is a definite plan for obtaining a sample from a given population. It refers to the technique or the procedure the researcher would adopt in selecting items for the sample. Sample design may as well lay down the number of items to be included in the sample i.e., the size of the sample. Sample design is determined before data are collected. There are many sample designs from which a researcher can choose. Some designs are relatively more precise and easier to apply than others. Researcher must select/prepare a sample design which should be reliable and appropriate for his research study. 13.2 STAGES OF DATA ANALYSIS Almost all researchers will be extremely eager to begin data analysis as soon as the fieldwork is completed, and this is understandable. Now that the raw data has been transformed into intelligence, it is possible to make decisions. Raw data, on the other hand, may not be in a format that is conducive to further analysis. The raw data is recorded exactly as the respondent indicated it would be. Unlike an oral response, where the raw data are the respondent's words, a questionnaire response where the actual number checked is the number stored is a difference between two worlds. Raw data will frequently contain errors in the form of respondent errors and non-respondent errors, as well as other types of errors. An interviewer or the person responsible for creating an electronic data file containing the responses can make a non-respondent error, whereas a respondent error is a mistake made by the respondent in a respondent error. 231
The data analysis process is illustrated in the diagram below. The results of the first two stages are an electronic file that can be used for data analysis. Using this file, you can run a variety of statistical routines, such as those associated with descriptive, univariate, bivariate, and multivariate analysis, among other things. Each of these approaches to data analysis will be discussed in greater detail in the following chapters. Checking for errors is a critical part of the editing, coding, and filing stages of the process. While there is still error in the data, the process of transforming raw data into intelligence will become riskier and more difficult to complete. The first two stages of the data analysis process are the editing and coding stages. Figure 13.1 Overview of the Stages of Data Analysis Data integrity refers to the notion that the data file actually contains the information that the researcher promised the decision maker he or she would obtain. Also included in the definition of data integrity is the fact that the data has been edited and properly coded in order to be useful to the decision maker. Any errors made during this process, just as any errors or shortcuts taken during the interview process, have the potential to compromise the integrity of the data. 232
13.2.1 Editing Fieldwork frequently results in data that contains errors. Consider the following simple questionnaire item and response, for instance: How long have you been a resident of your current residence? 48 The researcher had intended for the response to be expressed in years, not months. Alternatively, it is possible that the respondent has indicated the number of months rather than years that he or she has lived there. Alternatively, if this was an interviewer's form, it is possible that he or she marked the response in months without making this clear on the form. What should be done in this situation? It is possible that responses will be contradictory at times. Is it possible that the same respondent from above provides this response? What do you consider your age to be? 32 years of age This response runs counter to the previous response. If the respondent is 32 years old, how is it possible that he or she has resided at the same address for 48 years without being discovered? As a result, an adjustment should be made in order to account for this new information. According to the most likely scenario, this respondent has been a resident of the current address for four years. This example demonstrates the process of data editing. Editing is the process of checking and adjusting data to ensure that it is free of errors, consistent, and readable. As a result, the data is prepared for analysis by a computer system. It is therefore the responsibility of an editor to check questionnaires or other data collection forms for errors and omissions. The editor makes changes to the data when he or she notices a discrepancy, in order to make the information more complete, consistent, or readable. It is possible that the editor will have to reconstruct data at some point. For example, in the example above, the researcher can reasonably infer that the respondent answered the original questions in months rather than years. As a result, it is possible to reconstruct the probable correct answer. If a respondent's age was 55 years, it would not have been advisable to fill in the blanks with years unless there was other information available. If the respondent's age was 55 years, it would not have been advisable to fill in the blanks with years unless there was another piece of information available. Perhaps the respondent has been a resident of the house since he or she was a child? That possibility would appear real enough to prevent the response from being changed. 13.2.2 Coding Editing can be distinguished from coding, which is the process of assigning numerical scores or classifying symbols to previously edited data after it has been edited. The coding job is made easier by meticulous editing. Codes are intended to represent the meaning contained within the data set. The use of numerical symbols in questionnaires and interview forms allows for the transfer of information from one device to another. Codes are frequently, but 233
not always, composed of numerical symbols. They are, however, more broadly defined as rules for interpreting, classifying, and recording data in a more general sense. When it comes to qualitative research, numbers are rarely used as codes. Coding The process of assigning a numerical score or other character symbol to previously edited data. Codes Rules for interpreting, classifying, and recording data in the coding process; also, the actual numerical or other character symbols assigned to raw data Codebook Construction A codebook, also known as a coding scheme, contains information about each variable in the study and specifies how coding rules should be applied to each variable. Data entry and data analysis are made more accurate and efficient by using this method, which is used by the researcher or research staff. While conducting an analysis, it is also the most reliable source for determining the locations of variables within a data file. Often, the coding scheme is a crucial component of the data file in statistical programmes. Coding Rules Pre- and post-coding and categorization of a data set are guided by four rules that must be followed. A single variable's categories should be: • Appropriate to the research problem and purpose. • Exhaustive. • Mutually exclusive. • Derived from one classification dimension. When developing or selecting each specific measurement question, researchers take these considerations into consideration. One of the goals of pilot testing any measurement instrument is to identify and anticipate categorization problems before they occur. 13.2.3 The Data file Structured qualitative responses are coded and then stored in an electronic data file after they have been coded. In this location, it is likely that both the qualitative and quantitative responses of each respondent involved in a survey or interview are stored. There is a set of terms that can be used to describe this process and the file that is produced as a result. Some of the terminology used these days seems a little strange. What, for example, does a \"card\" have to do with a simple computer file is a mystery. The majority of the terminology used to describe files can be traced back to the early days of computing. Back then, data and 234
the computer programmes that generated the results were stored on physical computer cards, not on hard drives. Hopefully, readers will no longer be required to store data on physical cards in the future. There are much more convenient and cost-effective alternatives. Researchers organise coded data into cards, fields, records, and files, as well as into subcategories. A file is made up of cards, which are a collection of records that are kept together. An array of characters (each character representing a single number, letter, or special symbol such as a question mark) that represents a single piece of data, usually a variable, is known as a field. It's possible that some variables will require a large field, particularly when dealing with text data; however, other variables might only require a field of one character. To represent text variables, we use string characters, which in computer terminology refers to a sequence of alphabetic characters (i.e., non-numeric characters) that can be combined to form a word. String characters frequently contain fields of eight or more characters, which are known as long fields. A dummy variable, on the other hand, is a numeric variable that only requires one character to be used to create a field. A record is a collection of fields that are related to one another. A record was a way of representing a single, complete computer card on a computer system. When referring to the data of a single respondent, researchers may use the term record. A data file is a collection of records that are related to one another and that together form a data set. Value labels are extremely useful because they allow a word or a short phrase to be associated with a numerical coding system. It is possible to tell whether or not someone has an MBA degree based on the value label in this situation. It is necessary to match the labels to the numeric code in the following manner: • If dummy = 0, the value label is “no degree” • If dummy = 1, the value label is “MBA” The coder's use of value labelling will no doubt be appreciated by the analysts. When frequencies or other statistical output is generated for this variable, the value label will now appear instead of just a number in the output. This has the advantage of not requiring the analyst to recall which coding scheme was employed. In other words, he or she won't have to remember that a “1” denoted a Master of Business Administration. Other statistical programmes accommodate value labels in a similar manner to the one described here. The following format statement could be created by the coder using the SAS programming language: proc format. value labels 0 = ‘none’ 1 = ‘mba’; 235
data chap19. input dummy perf sales; format dummy labels.; This sequence reads three variables: dummy, perf (performance), and sales. Just as in the SPSS example, the sequence assigns the label “none” to a value of 0 for dummy and a label of “mba” to a value of 1 for dummy. The Data File Data is typically stored in a matrix format that is similar to that of a standard spreadsheet file. In a data file, the information from a research project is stored, and the information is typically represented as a rectangular arrangement (matrix) of data in rows and columns. 13.2.4 Analysis approach Lastly, error checking and verification (also known as data cleaning) are performed to ensure that all codes are valid and accurate and proceed with the analysis part. 13.3 DATA ENTRY Data entry is the process of converting information gathered through secondary or primary methods into a format that can be viewed and manipulated. Researchers who need to create a data file quickly and store it in a small amount of space on a variety of media continue to rely on keyboarding as their primary tool. Researchers, on the other hand, have benefited from more efficient methods of expediting the research process, particularly bar coding and optical character and mark recognition. 13.3.1 Alternative Data Entry Formats Keyboarding A full-screen editor, which allows you to edit or browse an entire data fi le at once, is a viable method of data entry for statistical packages such as SPSS or SAS. SPSS offers several data entry products, including Data Entry Builder, which allows for the creation of forms and surveys, and Data Entry Station, which allows for the entry of data into a database. Database Development Database programmes are extremely useful when working on large projects because they allow for easy data entry. A database is a collection of data that has been organised to allow for computerised retrieval of the information. Programs enable users to define data fields and link fi les, which simplifies the process of storing, retrieving, and updating information. Relationships between data fields, data records, files, and databases are discussed in detail. The orders placed by a company serve as an example of a database. 236
In some cases, ordering information is stored in multiple files, including salesperson's customer files, customer financial records, order manufacturing records, and order shipping documentation. The information is organised so that only those who are authorised can see the portions of it that are relevant to their needs. If the files are linked together, a customer can change his or her shipping address once and the change is reflected in all of the relevant files, saving time and effort. Another method of entering data into a database is through e- mail data capture. It has become increasingly popular among those who conduct surveys via e-mail. If the e-mail address of a specific respondent is known, the survey can be delivered to that respondent via e-mail. Input is entered into a database after questions are completed on a computer screen and returned via e-mail. An intranet can also be used to collect information. When participants connected by a network participate in an online survey by filling out a database form, the information is stored in a database on a network server for later or real- time analysis, depending on the survey. Unwanted participants can be prevented from skewing the results of an online survey by requiring them to enter their ID and password. Spreadsheet Spreadsheets Spreadsheet Spreadsheets are a specialized type of database for data that need organizing, tabulating,and simple statistics. They also offer some database management, graphics, and presentation capabilities. Data entry on a spreadsheet uses numbered rows and lettered columns with a matrixof thousands of cells into which an entry may be placed. Spreadsheets allow you to type numbers,formulas, and text into appropriate cells. Voice Recognition The rise in the use of computerised random dialling has spurred the development of new methods of data collection. For telephone interviewers, voice recognition and voice response systems are providing some interesting alternatives to traditional methods of interviewing. In response to a voice response received from a randomly selected number, the computer enters a questionnaire routine. (See also: These systems are improving at a rapid pace and will soon be able to convert recorded voice responses into digital data files. Another capability made possible by computers connected to telephone lines is the digital telephone keypad response, which is frequently used by restaurants and entertainment venues to evaluate customer service and to improve service. A participant who has been invited to participate answers questions by pressing the appropriate number on the telephone keypad (touch-tone). The data is captured by the computer by decoding the electrical signal produced by the tone and storing the numeric or alphabetic response in a data file. Despite the fact that they were not originally intended for use in survey data collection, software components within Microsoft Windows 7 include advanced speech recognition functionality, allowing people to enter and edit data simply by speaking into a microphone. Instead of clipboards and pencils, field interviewers can use mobile computers or notebooks to conduct their 237
interviews. With a built-in communications modem, wireless LAN (or local area network), or cellular link, their files can be sent directly to another computer in the field or to a remote site without the need for a separate network connection (cloud). Using this method, supervisors can inspect data immediately, and the data processing at a centralised facility is simplified. Exactly this type of technology is being used by Nielsen Media in its portable People Meter device. Bar Codes Bar Codes are a type of bar code that is used to identify items. After being introduced as a technological curiosity in 1973, the bar code has evolved into a business mainstay over the years. Bar codes became widely used in the grocery industry after a McKinsey & Company study was done, and the Kroger grocery chain pilot-tested a production system as a result. The use of bar-code technology simplifies the interviewer's role as a data recorder, which saves time. As soon as an interviewer passes a bar-code wand over the appropriate codes, the data is recorded in a small, lightweight unit that will be used for later translation. The Census Data Capture Center used bar codes to identify residents in the large-scale processing project Census 2000, which was completed in 2000. Researchers conducting research on magazine readership can scan bar codes to identify a magazine cover that has been recognised by an interview participant and then record the results. In a variety of applications, such as point-of-sale terminals, hospital patient ID bracelets, inventory control, product and brand tracking (for promotional technique evaluation), shipment tracking (for marathon runners), rental car locations (to speed up the return of cars and generate invoices), and tracking the mating habits of insects, the bar code has become widely adopted. Boats in storage are labelled with bar codes that are two feet long, according to the military. The codes can be found on a variety of business documents, truck parts, and lumberyard logs. Codabar is a barcode that is used on Federal Express shipping labels. Other codes, which include both letters and numbers, have the potential to be useful to researchers. 13.2.2 On the Horizon Innovative approaches hold great promise even in the face of time reductions between data collection and analysis. In recent years, the ability to integrate visual images, live streaming video, audio, and data has supplanted video equipment as the preferred method for recording an experiment, interview, or focus group discussion. It is possible to extract data from the responses for data analysis, while the audio and visual images are preserved for later evaluation. Although technology will never be able to completely replace researcher judgement, it can help to reduce data handling errors, shorten the time between data collection and analysis, and provide more usable information by streamlining data collection and analysis. 238
13.4 EXPLORATORY DATA ANALYSIS Exploratory data analysis is both a data analysis perspective and a set of techniques. Exploratory data analysis is the first step in the search for evidence; without it, confirmatory data analysis will have nothing to evaluate. 13.4.1 Frequency Tables First and foremost, when considering the most effective way to display data, it is important to consider whether or not a graphical method is required at all. The saying \"A picture is worth 1,000 words\" holds true in some cases, but frequency tables are often more effective than graphs at conveying information in other situations. When it comes to numbers, this is especially true when it is more important to look at the actual values of the numbers in different categories rather than the overall pattern among categories. A frequency table is a useful tool for presenting large amounts of data because it is a good compromise between text (paragraphs describing the data values) and pure graphics (graphs showing the data values) (such as a histogram). Consider the case of a university that is interested in gathering information about the general health of their incoming freshmen classes. In the United States, obesity is a growing public health concern. As a result, one of the statistics that are collected is the Body Mass Index (BMI), which is calculated by dividing the weight in kilogrammes by the squared height in metres. The BMI is not a fool proof method of determining health. For example, athletes often measure as underweight (distance runners, gymnasts) or overweight or obese (football players, weight throwers), but it is a simple measurement that is a reliable indicator of a healthy or unhealthy body weight for a large proportion of the population. Although the BMI is a continuous measure, it is frequently interpreted in terms of categories, with commonly accepted ranges as the starting point. As shown in the table below, BMI ranges established by the Centers for Disease Control and Prevention (CDC) and the World Health Organization (WHO) are generally considered to be useful and valid. BMI range Category < 18.5 Underweight 18.5–24.9 Normal weight 25.0–29.9 Overweight 30.0 and above Obese Table 13.1 CDC/WHO categories for BMI 239
Now consider table 13.2, an entirely fictitious list of BMI classifications for entering freshmen. BMI range Category < 18.5 25 18.5–24.9 500 25.0–29.9 175 30.0 and above 50 Table 13.2 Distribution of BMI in the freshman class of 2005 This simple table tells us at a glance that most of the freshman are of normal body weight or are moderately overweight, with a few who are underweight or obese. Note that this table presents raw numbers or counts for each category, which are sometimes referred to as absolute frequencies; these numbers tell you how often each value appears, which can be useful if you are interested in, for instance, how many students might require obesity counselling. However, absolute frequencies don’t place the number of cases in each category into any kind of context. We can make this table more useful by adding a column for relativefrequency, which displays the percent of the total represented by each category. The relative frequency is calculated by dividing the number of cases in each category by the total number of cases (750) and multiplying by 100. Table 13.3 shows the both the absolute and the relative frequencies for this data. BMI range Number Relative frequency < 18.5 25 3.30% 18.5–24.9 500 66.70% 25.0–29.9 175 23.30% 30.0 and above 50 6.70% Table 13.3 Absolute and relative frequency of BMI categories for the freshmen class of 2005 240
Note that relative frequencies should add up to approximately 100%, although the total might be slightly higher or lower due to rounding error. We can also add a column for cumulative frequency, which shows the relative frequency for each category and those below it, as in Table 13.4. The cumulative frequency for the final category should always be 100% except for rounding error. BMI range Number Relative frequency Cumulative frequency < 18.5 25 3.30% 3.30% 18.5–24.9 500 66.70% 70.00% 25.0–29.9 175 23.30% 93.30% 30.0 and above 50 6.70% 100% Table 13-4. Cumulative frequency of BMI in the freshman class of 2005 Cumulative frequency tells us at a glance, for instance, that 70% of the entering class is normal weight or underweight. This is particularly useful in tables with many categories because it allows the reader to ascertain specific points in the distribution quickly, such as the lowest 10%, the median (50% of the cumulative frequency), or the top 5%. You can also construct frequency tables to make comparisons between groups. You might be interested, for instance, in comparing the distribution of BMI in male and female freshmen or for the class that entered in 2005 versus the entering classes of 2000 and 1995. When making comparisons of this type, raw numbers are less useful (because the size of the classes can differ) and relative and cumulative frequencies more useful. Another possibility is to create graphic presentations such as the charts described in the next section, which can make such comparisons clearer. 13.4.2 Bar Charts, and Pie Charts The bar chart is particularly well suited for displaying discrete data with only a few categories, such as the BMI of the freshman class in our illustration. The bars in a bar chart are typically spaced apart from one another so that they do not imply continuity; although our categories are based on categorising a continuous variable in this case, they could equally well be completely nominal categories such as favourite sport or major field of study. Figure 13.1 shows the freshman BMI information presented in a bar chart. (Unless otherwise noted, the charts presented in this chapter were created using Microsoft Excel.) 241
Figure 13.1 Bar Charts The concept of relative frequencies becomes even more useful if we compare the distribution of BMI categories over several years. Consider the fictitious frequency information in Table 13-5. BMI range 1995 2000 2005 Underweight < 8.90% 45 6.80% 25 3.30% 50 66.70% 18.5 Normal 18.5– 71.40% 450 67.70% 500 24.9 400 Overweight 100 17.90% 130 19.50% 175 23.30% 25.0–29.9 Obese 30.0 and 1.80% 40 6.00% 50 6.70% 10 above Total 560 100.00% 665 100.00% 750 100.00% Table 13.5. Absolute and relative frequencies of BMI for three entering classes Because the class size varies from year to year, relative frequencies (percentages) are the most useful for observing trends in the distribution of weight categories in a given year. Specifically, there has been a significant decrease in the proportion of underweight students, while there has been an increase in the number of overweight and obese students in this case. 242
In addition to a pie chart, as shown in Figure 13.2, this information can be displayed using a bar chart. An analysis of the data shows that, over the course of ten years, there has been a small but discernible trend toward fewer underweight and normal weight students and an increase in the number of overweight and obese students (reflecting changes in the American population at large). Take note that creating a chart is not the same thing as carrying out an analysis of data; therefore, we cannot tell from this chart alone whether or not the differences are statistically significant. Figure 13.2 Bar chart of BMI distribution in three entering classes Pie Charts The well-known pie chart presents data in a manner similar to that of the stacked bar chart: it graphically depicts the proportion of each part of the whole that each part occupies. Pivot charts, similar to stacked bar charts, are most useful when there are only a few categories of information to display and the differences between those categories are quite significant. Despite the fact that pie charts are still widely used in some fields, they have been aggressively decried in others as uninformative at best and potentially misleading at worst by many people. The data will be presented in pie chart form (Figure 13-3), so you must decide whether this is a useful way to present the information based on the context and conventions in which you find yourself. Keep in mind that this is a single pie chart with data from a single year, but there are a variety of other options available, such as side-by-side charts (which allow for easier comparison of the proportions of different groups) and exploded sections (to show a more detailed breakdown of categories within a segment). 243
Figure 13.3. Pie chart showing BMI distribution for freshmen entering in 2005 13.4.3 Histograms In addition to the bar chart, the histogram is a popular choice for displaying continuous data. A histogram is similar in appearance to a bar chart; however, in a histogram, the bars (also known as bins because you can think of them as bins into which values from a continuous distribution are sorted) touch each other, whereas the bars in a bar chart do not touch each other. Histograms also have a greater number of bars than bar charts, which is another advantage. In a histogram, the bars do not necessarily have to be the same width, although this is frequently the case. Instead of simply a series of labels, the x-axis (horizontal axis) in a histogram represents a scale, and the area of each bar represents the proportion of values that are contained within a given range of values. It should be noted that the shape of this histogram is very similar to the shape of the stem- and-leaf plot of the same data (Figure 13.4), despite the fact that the data has been rotated 90 degrees. 244
Figure 13.4 Histogram with a bin width of 10 13.4.4 Stem-and-Leaf Displays The types of charts discussed so far are most appropriate for displaying categorical data. Continuous data has its own set of graphic display methods. One of the simplest ways to display continuous data graphically is the stem-and-leaf plot, which can easily be created by hand and presents a quick snapshot of a data distribution. To make a stem-and-leaf plot, divide your data into intervals (using your common sense and the level of detail appropriate to your purpose) and display each data point by using two columns. The stem is the leftmost column and contains one value per row, and the leaf is the rightmost column and contains one digit for each case belonging to that row. This creates a plot that displays the actual values of the data set but also assumes a shape indicating which ranges of values are most common. The numbers can represent multiples of other numbers (for instance, units of 10,000 or of 0.01) if appropriate, given the data values in question. Here’s a simple example. Suppose we have the final exam grades for 26 students and want to present them graphically. These are the grades: 61, 64, 68, 70, 70, 71, 73, 74, 74, 76, 79, 80, 80, 83, 84, 84, 87, 89, 89, 89, 90 92, 95, 95, 98, 100 The logical division is units of 10 points, for example, 60–69, 70–79, and so on, so we construct the stem of the digits 6, 7, 8, 9 (the tens place for those of you who remember your grade school math) and create the leaf for each number with the digit in the ones place, ordered left to right from smallest to largest. Figure 4-25 shows the final plot. 245
Figure 13.5 Stem-and-leaf plot of final exam grades 13.4.5 Pareto Diagrams This chart, also known as a Pareto diagram, has the characteristics of both bar charts and line charts. The frequency and relative frequency are displayed on the bars, while the cumulative frequency is displayed on the line. The major advantage of using a Pareto chart is that it makes it simple to see which factors are most important in a situation and, as a result, which factors should receive the most attention. Examples include the use of Pareto charts in industrial settings to identify factors that are responsible for the preponderance of delays or defects in the manufacturing process, among other things. If the bars are ordered descending from left to right (so that the most common cause is the furthest to one side, and the least common cause is the furthest to the other), and a cumulative frequency line is superimposed over the bars, the chart will be called a Pareto chart (so you see, for instance, how many factors are involved in 80 percent of production delays). Table 13.6 illustrates an illustrative hypothetical data set, which displays the number of defects that can be traced to different aspects of the manufacturing process in an automobile factory. Department Number of defects Accessory 350 Body 500 Electrical 120 Engine 150 Transmission 80 246
Table 13.6 Manufacturing defects by department Although we can see that the Accessory and Body departments are responsible for the greatest number of defects, it is not immediately obvious what proportion of defects can be traced to them. Figure 13.6, which displays the same information presented in a Pareto chart (produced using SPSS), makes this clearer. Figure 13.6. Major causes of manufacturing defects 13.4.6 Boxplots Developed by the statistician John Tukey, the boxplot, also known as the hinge plot or the box-and-whiskers plot, is a convenient way to summarise and display the distribution of a set of continuous data. Although boxplots can be drawn by hand (as can many other graphics, such as bar charts and histograms), in practise they are almost always created using software to simplify the process. The exact methods by which boxplots are constructed vary from one software package to another, but all boxplots are designed to highlight five important characteristics of a data set: the median, the first and third quartiles (and thus the interquartile range as well), the minimum and maximum values. The central tendency, range, symmetry, and presence of outliers in a data set are visible at a glance from a boxplot, whereas side-by- side boxplots make it easy to make comparisons among different distributions of data. This figure shows a boxplot of the final exam grades used in the stem-and-leaf plot that was shown earlier in this section. 247
Figure 13.7 Boxplot of exam data (created in SPSS) In this case, the median value is 81.5, and the dark line represents the mean. Accordingly, the interquartile range is enclosed by a shaded box, with the lower boundary being the first quartile (25th percentile) of 72.5 and the upper boundary being the third quartile (75th percentile) of 87.75 as the lower and upper boundaries, respectively. Tukey referred to these quartiles as hinges, which is how the hinge plot got its name. The short horizontal lines at 61 and 100 represent the minimum and maximum values, respectively. Together with the lines connecting them to the interquartile range box, these lines are referred to as whiskers, which is how the box-and-whiskers plot got its name. This data set appears to be symmetrical at first glance because the median is approximately Centered within the interquartile range, and the interquartile range is approximately Centered within the entire range of data. 13.5 CROSS-TABULATION What is cross-tabulation and how does it work? Have you ever taken a closer look at the nutrition facts on the back of a snack pack? You can see in this small table how each snack will affect your overall energy levels and how much of a difference it will make. The results of the analysis assist you in making informed decisions about your diet and calorie consumption. Cross-tabulation is a statistical model for the mainframe that follows a similar pattern. It assists you in making informed decisions about your research by identifying patterns, trends, and the relationship between the parameters of your study. When conducting a research study, dealing with raw data can be a daunting task. They will always point to a number of chaotic possibilities as a starting point. In such a situation, cross-tabulation can assist you in 248
identifying a single theory that is beyond doubt by identifying trends, comparisons, and correlations between factors that are mutually inclusive within your study. Let's say you're putting together a college application. You were probably not aware of it at the time, but you were mentally comparing and contrasting the various factors in order to make a conscious decision about which colleges you wanted to attend and which ones you had the best chance of getting into while applying. Let's take your decision-making process one factor at a time and see where it takes us. 13.5.1 The Use of Percentages When it comes to data presentation, percentages serve two functions. The data is first simplified by reducing all numbers to a range between zero and one hundred. Second, they convert the data into standard form, with a base of 100, so that they can be compared to one another. It is meaningless in a sampling situation unless the number of cases falling into a category is related to some base, which it is not in this case. A count of 28 overseas assignees is meaningless unless we know that they are drawn from a sample of 100 people. Using the latter as a starting point, we conclude that 28 percent of the sample in this study is on an international assignment. Although the information provided above is useful, it is even more so when the research problem necessitates a comparison of several different data distributions. For example, let us suppose that the previously reported data were collected five years ago and that the current study had a sample of 1,500 participants, of whom 360 were selected for overseas assignments. We can see the relative relationships and shifts in the data if we use percentages to represent them. Figure 13.7 Comparison of Percentages in Cross-Tabulation Studies by Overseas Assignment 13.5.2 Other Table-Based Analysis The discovery of a significant association between variables usually indicates that more research is needed. Even if a statistically significant association is discovered, the questions 249
of why and under what circumstances remain unanswered. It is frequently required to introduce a control variable to interpret the relationship. The framework is made up of cross- tabulation tables. Many choices for the development of n-way tables with many control variables are available in statistical applications such as Minitab, SAS, and SPSS. Let's say you want to make a cross-tabulation of two variables with one control. The number of tables is determined by the control variable, which has five values, regardless of the number of values in the major variables. For some applications, five different tables are appropriate; for others, contiguous tables or having the values of all the variables in one table may be preferred. The latter type of report is the management report. Figure below illustrates how all three variables are managed under the same umbrella. This programme, for example, can handle significantly more sophisticated tables and statistical data. Automatic interaction detection is a more advanced version of n-way tables (AID). AID is a computerised statistical procedure that requires the researcher to select a dependent variable as well as a series of predictors or independent variables. The computer then selects the best single division of the data according to each predictor variable from up to 300 variables, chooses one, and splits the sample using a statistical test to evaluate the appropriateness of this option. Figure 13.8 Automatic Interaction Detection Example (MindWriter’s Repair Satisfaction) 250
Search
Read the Text Version
- 1
- 2
- 3
- 4
- 5
- 6
- 7
- 8
- 9
- 10
- 11
- 12
- 13
- 14
- 15
- 16
- 17
- 18
- 19
- 20
- 21
- 22
- 23
- 24
- 25
- 26
- 27
- 28
- 29
- 30
- 31
- 32
- 33
- 34
- 35
- 36
- 37
- 38
- 39
- 40
- 41
- 42
- 43
- 44
- 45
- 46
- 47
- 48
- 49
- 50
- 51
- 52
- 53
- 54
- 55
- 56
- 57
- 58
- 59
- 60
- 61
- 62
- 63
- 64
- 65
- 66
- 67
- 68
- 69
- 70
- 71
- 72
- 73
- 74
- 75
- 76
- 77
- 78
- 79
- 80
- 81
- 82
- 83
- 84
- 85
- 86
- 87
- 88
- 89
- 90
- 91
- 92
- 93
- 94
- 95
- 96
- 97
- 98
- 99
- 100
- 101
- 102
- 103
- 104
- 105
- 106
- 107
- 108
- 109
- 110
- 111
- 112
- 113
- 114
- 115
- 116
- 117
- 118
- 119
- 120
- 121
- 122
- 123
- 124
- 125
- 126
- 127
- 128
- 129
- 130
- 131
- 132
- 133
- 134
- 135
- 136
- 137
- 138
- 139
- 140
- 141
- 142
- 143
- 144
- 145
- 146
- 147
- 148
- 149
- 150
- 151
- 152
- 153
- 154
- 155
- 156
- 157
- 158
- 159
- 160
- 161
- 162
- 163
- 164
- 165
- 166
- 167
- 168
- 169
- 170
- 171
- 172
- 173
- 174
- 175
- 176
- 177
- 178
- 179
- 180
- 181
- 182
- 183
- 184
- 185
- 186
- 187
- 188
- 189
- 190
- 191
- 192
- 193
- 194
- 195
- 196
- 197
- 198
- 199
- 200
- 201
- 202
- 203
- 204
- 205
- 206
- 207
- 208
- 209
- 210
- 211
- 212
- 213
- 214
- 215
- 216
- 217
- 218
- 219
- 220
- 221
- 222
- 223
- 224
- 225
- 226
- 227
- 228
- 229
- 230
- 231
- 232
- 233
- 234
- 235
- 236
- 237
- 238
- 239
- 240
- 241
- 242
- 243
- 244
- 245
- 246
- 247
- 248
- 249
- 250
- 251
- 252
- 253
- 254
- 255
- 256
- 257
- 258
- 259
- 260
- 261
- 262
- 263
- 264
- 265
- 266
- 267
- 268
- 269
- 270
- 271
- 272
- 273
- 274
- 275
- 276
- 277
- 278
- 279
- 280
- 281
- 282
- 283
- 284
- 285
- 286
- 287