Browsing by Subject "Biostatistics"

Now showing 1 - 3 of 3

Evaluation of alternative statistical methods for genomic selection for quantitative traits in hybrid maize
(2012) Schulz-Streeck, Torben; Piepho, Hans-Peter
The efficacy of several contending approaches for Genomic selection (GS) were tested using different simulation and empirical maize breeding datasets. Here, GS is viewed as a general approach, incorporating all the different stages from the phenotypic analysis of the raw data to the marker-based prediction of the breeding values. The overall goal of this study was to develop and comparatively evaluate different approaches for accurately predicting genomic breeding values in GS. In particular, the specific objectives were to: (1) Develop different approaches for using information from analyses preceding the marker-based prediction of breeding values for GS. (2) Extend and/or suggest efficient implementations of statistical methods used at the marker-based prediction stage of GS, with a special focus on improving the predictive accuracy of GS in maize breeding. (3) Compare different approaches to reliably evaluate and compare methods for GS. An important step in the analyses preceding the marker-based prediction is the phenotypic analysis stage. One way of combining phenotypic analysis and marker-based prediction into a single stage analysis is presented. However, a stagewise analysis is typically computationally more efficient than a single stage analysis. Several different weighting schemes for minimizing information loss in stagewise analyses are therefore proposed and explored. It is demonstrated that orthogonalizing the adjusted means before submitting them to the next stage is the most efficient way within the set of weighting schemes considered. Furthermore, when using stagewise approaches, it may suffice to omit the marker information until the very last stage, if the marker-by-environment interaction has only a minor influence, as was found to be the case for the datasets considered in this thesis. It is also important to ensure that genotypic and phenotypic data for GS are of sufficiently high-quality. This can be achieved by using appropriate field trial designs and carrying out adequate quality controls to detect and eliminate observations deemed to be outlying based on various diagnostic tools. Moreover, it is shown that pre-selection of markers is less likely to be of high practical relevance to GS in most cases. Furthermore, the use of semivariograms to select models with the greatest strength of support in the data for GS is proposed and explored. It is shown that several different theoretical semivariogram models were all well supported by an example dataset and no single model was selected as being clearly the best. Several methods and extensions of GS methods have been proposed for marker-based prediction in GS. Their predictive accuracies were similar to that of the widely used ridge regression best linear unbiased prediction method (RR-BLUP). It is thus concluded that RR-BLUP, spatial methods, machine learning methods, such as componentwise boosting, and regularized regression methods, such as elastic net and ridge regression, have comparable performance and can therefore all be routinely used for GS for quantitative traits in maize breeding. Accounting for environment-specific or population-specific marker effects had only minor influence on predictive accuracy contrary to findings of several other studies. However, accuracy varied markedly among populations, with some populations showing surprisingly very low levels of accuracy. Combining different populations prior to marker-based prediction improved prediction accuracy compared to doing separate population-specific analyses. Moreover, polygenetic effects can be added to the RR-BLUP model to capture genetic variance not captured by the markers. However, doing so yielded minor improvements, especially for high marker densities. To relax the assumption of homogenous variance of markers, the RR-BLUP method was extended to accommodate heterogeneous marker variances but this had negligible influence on the predictive accuracy of GS for a simulated dataset. The widely used information-theoretic model selection criterion, namely the Akaike information criterion (AIC), ranked models in terms of their predictive accuracies similar to cross-validation in the majority of cases. But further tests would be required to definitively determine whether the computationally more demanding cross-validation may be substituted with the more efficient model selection criteria, such as AIC, without much loss of accuracy. Overall, a stagewise analysis, in which the markers are omitted until at the very last stage, is recommended for GS for the tested datasets. The particular method used for marker-based prediction from the set of those currently in use is of minor importance. Hence, the widely used and thoroughly tested RR-BLUP method would seem adequate for GS for most practical purposes, because it is easy to implement using widely available software packages for mixed models and it is computationally efficient.
Statistical methods for analysis of multienvironment trials in plant breeding
accuracy and precision
(2021) Buntaran, Harimurti; Piepho, Hans-Peter
Multienvironment trials (MET) are carried out every year in different environmental conditions to evaluate a vast number of cultivars, i.e., yield, because different cultivars perform differently in various environmental conditions, known as genotype×environment interactions. MET aim to provide accurate information on cultivar performance so that a recommendation of which cultivar performs the best in a growers’ field condition can be available. MET data is often analysed via mixed models, which allow the cultivar effect to be random. The random effect of cultivar enables genetic correlation to be exploited across zones and considering the trials’ heterogeneity. A zone can be viewed as a larger target of population environments. The accuracy and precision of the cultivar predictions are crucial to be evaluated. The prediction accuracy can be evaluated via a cross-validation (CV) study, and the model selection can be done based on the lowest mean squared error prediction (MSEP). Also, since the trials’ locations hardly coincide with growers’ field, the precision of predictions needs to be evaluated via standard errors of predictions of cultivar values (SEPV) and standard errors of the predictions of pairwise differences of cultivar values (SEPD). The central objective of this thesis is to assess the model performance and conduct model selection via a CV study for zone-based cultivar predictions. Chapter 2 assessed the performance between empirical best linear unbiased estimations (EBLUE) and empirical best linear unbiased predictions (EBLUP) for zone-based prediction. Different CV schemes were done for the single-year and multi-year datasets to mimic the practice. A complex covariance structure such as factor-analytic (FA) was imposed to account for the heterogeneity of cultivar×zone (CZ) effect. The MSEP showed that the EBLUP models outperformed the EBLUE models. The zonation was necessary since it improved the accuracy and was preferable to make cultivar recommendations. The FA structure did not improve the accuracy compared to the simpler covariance structure, and so the EBLUP model with a simple covariance structure is sufficient for the single and multi-year datasets. Chapter 3 assessed the single-stage and stagewise analyses. The three weighting methods were compared in the stagewise analysis: two diagonal approximation methods and the fully efficient method with the unweighted analysis. The assessment was based on the MSEP instead of Pearson’s and Spearman’s correlation coefficients since the correlation coefficients are often very close between the compared models. The MSEP showed that the single-stage EBLUP and the stagewise weighting EBLUP strategy were very similar. Thus, the loss of information due to diagonal approximation is minor. In fact, the MSEP showed a more apparent distinction between the single-stage and the stagewise weighting analyses with the unweighted EBLUE compared to the correlation coefficients. The simple compound-symmetric covariance structure was sufficient for the CZ effect than the more complex structures. The choice between the single-stage and stagewise weighting analysis, thus, depends on the computational resources and the practicality of data handling. Chapter 4 assessed the accuracy and precision of the predictions for the new locations. The environmental covariates were combined with the EBLUP in the random coefficient (RC) models since the covariates provide more information for the new locations. The MSEP showed that the RC models were not the model with the smallest MSEP, but the RC models had the lowest SEPV and SEPD. Thus, the model selection can be done by joint consideration of the MSEP, SEPV, and SEPD. The models with EBLUE and covariate interaction effects performed poorly regarding the MSEP. The EBLUP models without RC performed best, but the SEPV and SEPD were large, considered unreliable. The covariate scale and selection are essential to obtain a positive definite covariance matrix. Employing unstructured covariance int the RC is crucial to maintaining the RC models’ invariance feature. The RC framework is suitable to be implemented with GIS data to provide an accurate and precise projection of cultivar performance for the new locations or environments. To conclude, the EBLUP model for zoned-based predictions should be preferred to obtain the predictions and rankings closer to the true values and rankings. The stagewise weighting analysis can be recommended due to its practicality and its computational efficiency. Furthermore, projecting cultivar performances to the new locations should be done to provide more targeted information for growers. The available environmental covariates can be utilised to improve the predictions’ accuracy and precision in the new locations in the RC model framework. Such information is certainly more valuable for growers and breeders than just providing means across a whole target population of environments.
Weighting methods for variance heterogeneity in phenotypic and genomic data analysis for crop breeding
(2019) Damesa, Tigist Mideksa; Piepho, Hans-Peter
In plant breeding programmes MET form the backbone for phenotypic selection, GS and GWAS. Efficient analysis of MET is fundamental to get accurate results from phenotypic selection, GS and GWAS. On the other hand inefficient analysis of MET data may have consequences such as biased ranking of genotype means in phenotypic data analysis, small accuracy of GS and wrong identification of QTL in GWAS analysis. A combined analysis of MET is performed using either single-stage or stage-wise (two-stage) approaches based on the linear mixed model framework. While single-stage analysis is a fully efficient approach, MET data is suitably analyzed using stage-wise methods. MET data often show within-trial and between-trial variance heterogeneities, which is in contradiction with the homogeneity of variance assumption of linear models, and these heterogeneities require corrections. In addition it is well documented that spatial correlations are inherent to most field trials. Appropriate remedial techniques for variance heterogeneities and proper accounting of spatial correlation are useful to improve accuracy and efficiency of MET analysis. Chapter 2 studies methods for simultaneous handling of within-trial variance heterogeneity and within-trial spatial correlation. This study is conducted based on three maize trials from Ethiopia. To stabilize variance Box-Cox transformation was considered. The result shows that, while the Box-Cox transformation was suitable for stabilizing the variance, it is difficult to report results on the original scale. As alternative variance models, i.e. power-of-the-mean (POM) and exponential models, were used to fix the variance heterogeneity problem. Unlike the Box-Cox method, the variance models considered in this study were successful to deal simultaneously with both spatial correlation and heterogeneity of variance. For analysis of MET data, two-stage analysis is often favored in practice over single-stage analysis because of its suitability in terms of computation time, and its ability to easily account for any specifics of each trial (variance heterogeneity, spatial correlation, etc). Stage-wise analyses are approximate in that they cannot fully reproduce a single-stage analysis because the variance–covariance matrix of adjusted means from the first-stage analysis is sometimes ignored or sometimes approximated and the approximation may not be efficient. Discrepancy of results between single-stage and two-stage analysis increases when the variance between trials is heterogeneous. In stage-wise analysis one of the major challenges is how to account for heterogeneous variance between trials at the second stage. To account for heterogeneous variance between trials, a weighted mixed model approach is used for the second-stage analysis. The weights are derived from the variances and covariances of adjusted means from the first-stage analysis. In Chapter 3 we compared single-stage analysis and two-stage analysis. A new fully efficient and a diagonal weighting matrix are used for weighting in the second stage. The methods are explored using two different types of maize datasets. The result indicates that single-stage analysis and two-stage analysis give nearly identical results provided that the full information on all effect estimates and their associated estimated variances and covariances is carried forward from the first to the second stage. GWAS and GS analysis can be conducted using a single-stage or a stage-wise approach. The computational demand for GWAS and GS increases compared to purely phenotypic analysis because of the addition of marker data. Usually researchers compute genotype means from phenotypic MET data in stage-wise analysis (with or without weighting) and then forward these means to GWAS or GS analysis, often without any weighting. In Chapter 4 weighted stage-wise analysis versus unweighted stage-wise analysis are compared for GWAS and GS using phenotypic and genotypic maize data. Fully-efficient and a diagonal weighting are used. Results show that weighting is preferred over unweighted analysis for both GS and GWAS. In conclusion, stage-wise analysis is a suitable approach for practical analysis of MET, GS and GWAS analysis. Single-stage and two-stage analysis of MET yield very similar results. Stage-wise analysis can be nearly as efficient as single-stage analysis when using optimal weighting, i.e., fully-efficient weighting. Spatial variation and within-trial variance heterogeneity are common in MET data. This study illustrated that both can be resolved simultaneously using a weighting approach for the variance heterogeneity and spatial modeling for the spatial variation. Finally beside application of weighting in the analysis of phenotypic MET data, it is recommended to use weighting in the actual GS and GWAS analysis stage.