Browsing by Person "Piepho, Hans-Peter"

Now showing 1 - 20 of 39

Assessing the efficiency and heritability of blocked tree breeding trials
(2024) Piepho, Hans-Peter; Williams, Emlyn; Prus, Maryna
Progeny trials in tree breeding are often laid out using blocked experimental designs, in which families are randomly assigned to plots and several trees are planted per plot. Such designs are optimized for the assessment of family effects. However, tree breeders are primarily interested in assessing breeding values of individual trees. This paper considers the assessment of heritability at both the family and tree levels. We assess heritability based on pairwise comparisons among individual trees. The approach shows that there is considerable heterogeneity in pairwise heritabilities, primarily due to the differences in both genetic as well as error variances among within- and between-family comparisons. Our results further show that efficient blocking positively affects all types of comparison except those among trees within the same plot.
Biometrical approaches for analysing gene bank evaluation data on barley (Hordeum spec.)
(2007) Hartung, Karin; Piepho, Hans-Peter
This thesis explored methods to statistically analyse phenotypic data of gene banks. Traits of the barley data (Hordeum spp.) of the gene bank of the IPK-Gatersleben were evaluated. The data of years 1948-2002 were available. Within this period the ordinal scale changed from a 0-5 to a 1-9 scale after 1993. At most gene banks reproduction of accessions is currently done without any experimental design. With data of a single year only rarely do accessions have replications and there are only few replications of a single check for winter and summer barley. The data of 2002 were analysed separately for winter and summer barley using geostatistical methods. For the traits analysed four types of variogram model (linear, spherical, exponential and Gaussian) were fitted to the empirical variogram using non-linear regression. The spatial parameters obtained by non-linear regression for every variogram model then were implemented in a mixed model analysis and the four model fits compared using Akaike's Information Criterion (AIC). The approach to estimate the genetical parameter by Kriging can not be recommended. The first points of the empirical variogram should be explained well by the fitted theoretical variogram, as these represent most of the pairwise distances between plots and are most crucial for neighbour adjustments. The most common well-fitting geostatistical models were the spherical and the exponential model. A nugget effect was needed for nearly all traits. The small number of check plots for the available data made it difficult to accurately dissect the genetical effect from environmental effects. The threshold model allows for joint analysis of multi-year data from different rating scales, assuming a common latent scale for the different rating systems. The analysis suggests that a mixed model analysis which treats ordinal scores as metric data will yield meaningful results, but that the gain in efficiency is higher when using a threshold model. The threshold model may also be used when there is a metric scale underlying the observed ratings. The Laplace approximation as a numerical method to integrate the log-likelihood for random effects worked well, but it is recommended to increase the number of quadrature points until the change in parameter estimates becomes negligible. Three rating methods (1%, 5%, 9-point rating) were assessed by persons untrained (A) and experienced (B) in rating. Every person had to rate several pictograms of diseased leaves. The highest accuracy was found with Group B using the 1%-scale and with Group A using the 5%-scale. With a percentage scale Group A tended to use values that are multiples of 5%. For the time needed per leaf assessment the Group B was fastest when using the 5% rating scale. From a statistical point of view both percent ratings performed better than the ordinal rating scale and the possible error made by the rater is calculable and usually smaller than with ratings by rougher methods. So directly rating percentages whenever possible leads to smaller overall estimation errors, and with proper training accuracy and precision can be further improved. For gene banks augmented designs as proposed by Federer and by Lin et al. offer themselves, so an overview is given. The augmented designs proposed by Federer have the advantage of an unbiased error estimate. But the random allocation of checks is a problem. The augmented design by Lin et al. always places checks in the centre plot of every whole plot. But none of the methods is based on an explicit statistical model, so there is no well-founded decision criterion to select between them. Spatial analysis can be used to find an optimal field layout for an augmented design, i.e. a layout that yields small least significant differences. The average variance of a difference and the average squared LSD were used to compare competing designs, using a theoretical approach based on variations of two anisotropic models and different rotations of anisotropy axes towards field reference axes. Based on theoretical calculations, up to five checks per block are recommended. The nearly isotropic combinations led to designs with large quadratic blocks. With strongly anisotropic combinations the optimal design depends on degree of anisotropy and rotation of anisotropy axes: without rotation small elongated blocks are preferred; the closer the rotation is to 45° the more squarish blocks and the more checks are appropriate. The results presented in this thesis may be summarised as follows: Cultivation for regeneration of accessions should be based on a meaningful and statistically analysable experimental field design. The design needs to include checks and a random sample of accessions from the gene pool held at the gene bank. It is advisable to utilise metric or percentage rating scales. It can be expected that using a threshold model increases the quality of multivariate analysis and association mapping studies based on phenotypic gene bank data.
Biometrical tools for heterosis research
(2010) Schützenmeister, André; Piepho, Hans-Peter
Molecular biological technologies are frequently applied for heterosis research. Large datasets are generated, which are usually analyzed with linear models or linear mixed models. Both types of model make a number of assumptions, and it is important to ensure that the underlying theory applies for datasets at hand. Simultaneous violation of the normality and homoscedasticity assumptions in the linear model setup can produce highly misleading results of associated t- and F-tests. Linear mixed models assume multivariate normality of random effects and errors. These distributional assumptions enable (restricted) maximum likelihood based procedures for estimating variance components. Violations of these assumptions lead to results, which are unreliable and, thus, are potentially misleading. A simulation-based approach for the residual analysis of linear models is introduced, which is extended to linear mixed models. Based on simulation results, the concept of simultaneous tolerance bounds is developed, which facilitates assessing various diagnostic plots. This is exemplified by applying the approach to the residual analysis of different datasets, comparing results to those of other authors. It is shown that the approach is also beneficial, when applied to formal significance tests, which may be used for assessing model assumptions as well. This is supported by the results of a simulation study, where various alternative, non-normal distributions were used for generating data of various experimental designs of varying complexity. For linear mixed models, where studentized residuals are not pivotal quantities, as is the case for linear models, a simulation study is employed for assessing whether the nominal error rate under the null hypothesis complies with the expected nominal error rate. Furthermore, a novel step within the preprocessing pipeline of two-color cDNA microarray data is introduced. The additional step comprises spatial smoothing of microarray background intensities. It is investigated whether anisotropic correlation models need to be employed or isotropic models are sufficient. A self-versus-self dataset with superimposed sets of simulated, differentially expressed genes is used to demonstrate several beneficial features of background smoothing. In combination with background correction algorithms, which avoid negative intensities and which have already been shown to be superior, this additional step increases the power in finding differentially expressed genes, lowers the number of false positive results, and increases the accuracy of estimated fold changes.
Bird species richness and diversity responses to land use change in the Lake Victoria Basin, Kenya
(2024) Mugatha, Simon M.; Ogutu, Joseph O.; Piepho, Hans-Peter; Maitima, Joseph M.
The increasing demand for cultivated lands driven by human population growth, escalating consumption and activities, combined with the vast area of uncultivated land, highlight the pressing need to better understand the biodiversity conservation implications of land use change in Sub-Saharan Africa. Land use change alters natural wildlife habitats with fundamental consequences for biodiversity. Consequently, species richness and diversity typically decline as land use changes from natural to disturbed. We assess how richness and diversity of avian species, grouped into feeding guilds, responded to land use changes, primarily expansion of settlements and cultivation at three sites in the Lake Victoria Basin in western Kenya, following tsetse control interventions. Each site consisted of a matched pair of spatially adjacent natural/semi-natural and settled/cultivated landscapes. Significant changes occurred in bird species richness and diversity in the disturbed relative to the natural landscape. Disturbed areas had fewer guilds and all guilds in disturbed areas also occurred in natural areas. Guilds had significantly more species in natural than in disturbed areas. The insectivore/granivore and insectivore/wax feeder guilds occurred only in natural areas. Whilst species diversity was far lower, a few species of estrildid finches were more common in the disturbed landscapes and were often observed on the scrubby edges of modified habitats. In contrast, the natural and less disturbed wooded areas had relatively fewer estrildid species and were completely devoid of several other species. In aggregate, land use changes significantly reduced bird species richness and diversity on the disturbed landscapes regardless of their breeding range size or foraging style (migratory or non-migratory) and posed greater risks to non-migratory species. Accordingly, land use planning should integrate conservation principles that preserve salient habitat qualities required by different bird species, such as adequate patch size and habitat connectivity, conserve viable bird populations and restore degraded habitats to alleviate adverse impacts of land use change on avian species richness and diversity.
Breeding progress of nitrogen use efficiency of cereal crops, winter oilseed rape and peas in long-term variety trials
(2024) Laidig, Friedrich; Feike, T.; Lichthardt, C.; Schierholt, A.; Piepho, Hans-Peter
Breeding and registration of improved varieties with high yield, processing quality, disease resistance and nitrogen use efficiency (NUE) are of utmost importance for sustainable crop production to minimize adverse environmental impact and contribute to food security. Based on long-term variety trials of cereals, winter oilseed rape and grain peas tested across a wide range of environmental conditions in Germany, we quantified long-term breeding progress for NUE and related traits. We estimated the genotypic, environmental and genotype-by-environment interaction variation and correlation between traits and derived heritability coefficients. Nitrogen fertilizer application was considerably reduced between 1995 and 2021 in the range of 5.4% for winter wheat and 28.9% for spring wheat while for spring barley it was increased by 20.9%. Despite the apparent nitrogen reduction for most crops, grain yield (GYLD) and nitrogen accumulation in grain (NYLD) was increased or did not significantly decrease. NUE for GYLD increased significantly for all crops between 12.8% and 35.2% and for NYLD between 8% and 20.7%. We further showed that the genotypic rank of varieties for GYLD and NYLD was about equivalent to the genotypic rank of the corresponding traits of NUE, if all varieties in a trial were treated with the same nitrogen rate. Heritability of nitrogen yield was about the same as that of grain yield, suggesting that nitrogen yield should be considered as an additional criterion for variety testing to increase NUE and reduce negative environmental impact.
Correction to: Assessing the efficiency and heritability of blocked tree breeding trials
(2024) Piepho, Hans-Peter; Williams, Emlyn; Prus, Maryna
Correction to: Breeding progress of nitrogen use efficiency of cereal crops, winter oilseed rape and peas in long-term variety trials
(2024) Laidig, Friedrich; Feike, T.; Lichthardt, C.; Schierholt, A.; Piepho, Hans-Peter
Dependence of the abundance of reed glass-winged cicadas (Pentastiridius leporinus (Linnaeus, 1761)) on weather and climate in the Upper Rhine Valley, Southwest Germany
(2025) Kakarla, Sai Kiran; Schall, Eric; Dettweiler, Anna; Stohl, Jana; Glaser, Elisabeth; Adam, Hannah; Teubler, Franziska; Ingwersen, Joachim; Sauer, Tilmann; Piepho, Hans-Peter; Lang, Christian; Streck, Thilo; Guo, Jianying
The planthopper Pentastiridius leporinus , commonly called reed glass-winged cicada, transmits the pathogens “ Candidatus Arsenophonus phytopathogenicus” and “ Candidatus Phytoplasma solani”, which are infesting sugar beet and, most recently, also potato in the Upper Rhine valley area of Germany. They cause the “Syndrome Basses Richesses” associated with reduced yield and sugar content in sugar beet, leading to substantial monetary losses to farmers in the region. No effective solutions exist currently. This study uses statistical models to understand to what extent the abundance of cicadas depends on climate regions during the vegetation period (April–October). We further investigated what influence temperature and precipitation have on the abundance of the cicadas in sugar beet fields. Furthermore, we investigated the possible impacts of future climate on cicada abundance. Also, 22 °C and 8 mm/day were found to be the optimal temperature and precipitation conditions for peak male cicada flight activity, while 28 °C and 8 mm/day were the optimum for females. By the end of the 21st century, daily male cicada abundance is projected to increase significantly under the worst-case high greenhouse gas emission scenario RCP8.5 (RCP-Representative Concentration Pathways), with confidence intervals suggesting a possible 5–15-fold increase compared to current levels. In contrast, under the low-emission scenario RCP2.6, male cicada populations are projected to be 60–70% lower than RCP8.5. An understanding of the influence of changing temperature and precipitation conditions is crucial for predicting the spread of this pest to different regions of Germany and other European countries.
Design evaluation and predictive accuracy of multi-environment trials in plant breeding
(2025) Gudata, Diriba Tadese; Piepho, Hans-Peter
In plant breeding, predictive accuracy of genotype means in the target population of environment (TPE) can be improved through proper experimental design and statistical analysis. During experimentation, blocking and randomization are expected to handle the major source of heterogeneity in the field. When heterogeneity exist in both directions, across row and column, two-way blocking is necessary to ensure homogeneity within blocks. Several trials need to be conducted in the TPE to generalize information. The TPE can be divided to form zonation that allows for borrowing information between zones when fitting genotypes as random and to allow for the zone-specific recommendation. The multi-environment trials (MET) data analysis can follow either one-stage or stage-wise analysis where in the latter case, information from individual trials is forwarded to the next stage of analysis. The linear mixed models (LMM) is commonly used in the MET data analysis. Furthermore, auxiliary information from the locations, particularly soil information and weather data can be integrated to MET data analysis to improve predictive accuracy. In general, the objective of this thesis was to improve predictive accuracy of modeling MET data based on different approaches of integrating ECs and pedigree information. Different spatial model selection and design evaluation was conducted in the second chapter using existing MET data from dry lowland sorghum breeding program of Ethiopia. Randomization based model, augmenting randomization-based model with linear variance and exponential spatial variations were compared in partially replicated and fully replicated row- column designs using Akaike information criterion (AIC). The baseline model with a two- dimensional nonlinear spatial model plus nugget improved the fitted model in many trials. In addition, the randomization-based plus two-dimensional linear variance model was also a good candidate model. According to the AIC, it is difficult to find a specific model that suits all the trials. Therefore, trying different spatial models and select the best fit model per trial could be a solution. Evaluation of the current design practice was also assessed in the same chapter through generating alternative designs by restructuring the blocking units and computing the relative efficiency. The relative efficiency results indicate most of the alternative alpha designs with block sizes of five, six, ten, fifteen, and the alternative row-column designs were more efficient when compared to the current practice. In the third chapter, a method of extracting and fitting synthetic environmental covariates (SCs) and pedigree information in multi-location trials data analysis was investigated. The main goal of this chapter was comparing predictive accuracy of LMM without pedigree information and SCs and with pedigree or/and SC to predict genotype performances in untested locations. The SCs were extracted from the actual ECs by using multivariate partial least squares (PLS) analysis. Then, subsequently we fitted in the LMM assuming the random coefficients of genotypes. An unstructured variance-covariance matrix of the random intercept and slope(s) was considered to ensure translational invariance. For the model with pedigree information, the baseline model with the independent genotype effect was modified to allow correlation between genotype through parents. For the GEI effect, the identity, the diagonal and the FA variance-covariance structures were considered. The mean squared error of prediction differences (MSEPD) and Spearman rank correlation shows that integrating the SCs in MET improve predictive accuracy of the model compared to the model without SCs. In all different variance-covariance structures of the GEI models, integrating SC was beneficial. There is also improvement with modelling pedigree information using diagonal and FA variance-covariance structures for genotype-environment effects. The diagonal variance-covariance structure of the GEI with the SC is the most accurate model in predicting genotype means to the new locations. In Chapter 4, the predictive accuracy under different approaches of fitting ECs in predicting genotypic performance in new environments was evaluated. The kinship matrix based on ECs, reduced rank regression and extended Finlay-Wilkinson approaches were evaluated and compared in predicting genotype means. Among the others, the reduced rank regression approach showed the smallest MSEPD. The limitation with this approach is that there are singularity problems when the number of ECs exceeds the number of environments. For this reason, a variable selection by using multivariate PLS was conducted to consider only the very important covariates in the subsequent modelling. Over all, there is a substantial gain in predictive accuracy in considering ECs compared to the model without ECs. In addition, we evaluated the importance of fitting the geographic zone factor, however, the result shows less improvement compared to the model without the zone factor. This result may be related to a smaller number of trials in some of the zones. One limitation with the data set when considering the zone effect is that only few trials remained in the western and northern zones after removing trials with zero genotype variances during individual trials analysis. The southern zone comprises the majority of the trials. The optimum allocation of trials to the zones was also tried based on the variance-covariances of the genotype -by-zone interactions. In chapter 3 and 4, when predicting genotype performance to new environments, the drop-out-one-environment at a time cross-validation (CV) mechanism was considered. This type of CV mimics the prediction for new environments and assesses uncertainty in model prediction. In conclusion, this study developed methods for improving the accuracy of genotypic performance prediction models in METs by improving the design efficiency in ongoing breeding programs through post-blocking mechanism, by fitting spatial models to capture spatial field trends in an experiment, and by using ECs, SCs and pedigree information.
Early prediction of biomass in hybrid rye based on hyperspectral data surpasses genomic predictability in less-related breeding material
(2021) Galán, Rodrigo José; Bernal-Vasquez, Angela-Maria; Jebsen, Christian; Piepho, Hans-Peter; Thorwarth, Patrick; Steffan, Philipp; Gordillo, Andres; Miedaner, Thomas
Key message: Hyperspectral data is a promising complement to genomic data to predict biomass under scenarios of low genetic relatedness. Sufficient environmental connectivity between data used for model training and validation is required. Abstract: The demand for sustainable sources of biomass is increasing worldwide. The early prediction of biomass via indirect selection of dry matter yield (DMY) based on hyperspectral and/or genomic prediction is crucial to affordably untap the potential of winter rye (Secale cereale L.) as a dual-purpose crop. However, this estimation involves multiple genetic backgrounds and genetic relatedness is a crucial factor in genomic selection (GS). To assess the prospect of prediction using reflectance data as a suitable complement to GS for biomass breeding, the influence of trait heritability ( ) and genetic relatedness were compared. Models were based on genomic (GBLUP) and hyperspectral reflectance-derived (HBLUP) relationship matrices to predict DMY and other biomass-related traits such as dry matter content (DMC) and fresh matter yield (FMY). For this, 270 elite rye lines from nine interconnected bi-parental families were genotyped using a 10 k-SNP array and phenotyped as testcrosses at four locations in two years (eight environments). From 400 discrete narrow bands (410 nm–993 nm) collected by an uncrewed aerial vehicle (UAV) on two dates in each environment, 32 hyperspectral bands previously selected by Lasso were incorporated into a prediction model. HBLUP showed higher prediction abilities (0.41 – 0.61) than GBLUP (0.14 – 0.28) under a decreased genetic relationship, especially for mid-heritable traits (FMY and DMY), suggesting that HBLUP is much less affected by relatedness and . However, the predictive power of both models was largely affected by environmental variances. Prediction abilities for DMY were further enhanced (up to 20%) by integrating both matrices and plant height into a bivariate model. Thus, data derived from high-throughput phenotyping emerges as a suitable strategy to efficiently leverage selection gains in biomass rye breeding; however, sufficient environmental connectivity is needed.
Estimating heritability in plant breeding programs
(2019) Schmidt, Paul; Piepho, Hans-Peter
Heritability is an important notion in, e.g., human genetics, animal breeding and plant breeding, since the focus of these fields lies on the relationship between phenotypes and genotypes. A phenotype is the composite of an organism’s observable traits, which is determined by its underlying genotype, by environmental factors and by genotype-environment interactions. For a set of genotypes, the notion of heritability expresses the proportion of the phenotypic variance that is attributable to the genotypic variance. Furthermore, as it is an intraclass correlation, heritability can also be interpreted as, e.g., the squared correlation between phenotypic and genotypic values. It is important to note that heritability was originally proposed in the context of animal breeding where it is the individual animal that represents the basic unit of observation. This stands in contrast to plant breeding, where multiple observations for the same genotype are obtained in replicated trials. Furthermore, trials are usually conducted as multi-environment trials (MET), where an environment denotes a year × location combination and represents a random sample from a target population of environments. Hence, the observations for each genotype first need to be aggregated in order to obtain a single phenotypic value, which is usually done by obtaining some sort of mean value across trials and replicates. As a consequence, heritability in the context of plant breeding is referred to as heritability on an entry-mean basis and its standard estimation method is a linear combination of variances and trial dimensions. Ultimately, I find that there are two main uses for heritability in plant breeding: The first is to predict the response to selection and the second is as a descriptive measure for the usefulness and precision of cultivar trials. Heritability on an entry-mean basis is suited for both purposes as long as three main assumptions hold: (i) the trial design is completely balanced/orthogonal, (ii) genotypic effects are independent and (iii) variances and covariances are constant. In the last decades, however, many advancements in the methodology of experimental design for and statistical analysis of plant breeding trials took place. As a consequence it is seldom the case that all three of above mentioned assumptions are met. Instead, the application of linear mixed models enables the breeder to straightforwardly analyze unbalanced data with complex variance structures. Chapter 2 exemplarily demonstrates some of the flexibility and benefit of the mixed model framework for typically unbalanced MET by using a bivariate mixed model analyses to jointly analyze two MET for cultivar evaluation, which differ in multiple crucial aspects such as plot size, trial design and general purpose. Such an approach can lead to higher accuracy and precision of the analysis and thus more efficient and successful breeding programs. It is not clear, however, how to define and estimate a generalized heritability on an entry-mean basis for such settings. Therefore, multiple alternative methods for the estimation of heritability on an entry-mean basis have been proposed. In Chapter 3, six alternative methods are applied to four typically unbalanced MET for cultivar evaluation and compared to the standard method. The outcome suggests that the standard method over-estimates heritability, while all of the alternative methods show similar, lower estimates and thus seem able to handle this kind of unbalanced data. Finally, it is argued in Chapter 4 that heritability in plant breeding is not actually based on or aiming at entry-means, but on the differences between them. Moreover, an estimation method for this new proposal of heritability on an entry-difference basis (H_Delta^2/h_Delta^2) is derived and discussed, as well as exemplified and compared to other methods via analyzing four different datasets for cultivar evaluation which differ in their complexity. I argue that regarding the use of heritability as a descriptive measure, H_Delta^2/h_Delta^2, can on the one hand give a more detailed and meaningful insight than all other heritability methods and on the other hand reduces to other methods under certain circumstances. When it comes to the use of heritability as a means to predict the response to selection, the outcome of this work discourages this as a whole. Instead, response to selection should be simulated directly and thus without using any ad hoc heritability measure.
Evaluation of alternative statistical methods for genomic selection for quantitative traits in hybrid maize
(2012) Schulz-Streeck, Torben; Piepho, Hans-Peter
The efficacy of several contending approaches for Genomic selection (GS) were tested using different simulation and empirical maize breeding datasets. Here, GS is viewed as a general approach, incorporating all the different stages from the phenotypic analysis of the raw data to the marker-based prediction of the breeding values. The overall goal of this study was to develop and comparatively evaluate different approaches for accurately predicting genomic breeding values in GS. In particular, the specific objectives were to: (1) Develop different approaches for using information from analyses preceding the marker-based prediction of breeding values for GS. (2) Extend and/or suggest efficient implementations of statistical methods used at the marker-based prediction stage of GS, with a special focus on improving the predictive accuracy of GS in maize breeding. (3) Compare different approaches to reliably evaluate and compare methods for GS. An important step in the analyses preceding the marker-based prediction is the phenotypic analysis stage. One way of combining phenotypic analysis and marker-based prediction into a single stage analysis is presented. However, a stagewise analysis is typically computationally more efficient than a single stage analysis. Several different weighting schemes for minimizing information loss in stagewise analyses are therefore proposed and explored. It is demonstrated that orthogonalizing the adjusted means before submitting them to the next stage is the most efficient way within the set of weighting schemes considered. Furthermore, when using stagewise approaches, it may suffice to omit the marker information until the very last stage, if the marker-by-environment interaction has only a minor influence, as was found to be the case for the datasets considered in this thesis. It is also important to ensure that genotypic and phenotypic data for GS are of sufficiently high-quality. This can be achieved by using appropriate field trial designs and carrying out adequate quality controls to detect and eliminate observations deemed to be outlying based on various diagnostic tools. Moreover, it is shown that pre-selection of markers is less likely to be of high practical relevance to GS in most cases. Furthermore, the use of semivariograms to select models with the greatest strength of support in the data for GS is proposed and explored. It is shown that several different theoretical semivariogram models were all well supported by an example dataset and no single model was selected as being clearly the best. Several methods and extensions of GS methods have been proposed for marker-based prediction in GS. Their predictive accuracies were similar to that of the widely used ridge regression best linear unbiased prediction method (RR-BLUP). It is thus concluded that RR-BLUP, spatial methods, machine learning methods, such as componentwise boosting, and regularized regression methods, such as elastic net and ridge regression, have comparable performance and can therefore all be routinely used for GS for quantitative traits in maize breeding. Accounting for environment-specific or population-specific marker effects had only minor influence on predictive accuracy contrary to findings of several other studies. However, accuracy varied markedly among populations, with some populations showing surprisingly very low levels of accuracy. Combining different populations prior to marker-based prediction improved prediction accuracy compared to doing separate population-specific analyses. Moreover, polygenetic effects can be added to the RR-BLUP model to capture genetic variance not captured by the markers. However, doing so yielded minor improvements, especially for high marker densities. To relax the assumption of homogenous variance of markers, the RR-BLUP method was extended to accommodate heterogeneous marker variances but this had negligible influence on the predictive accuracy of GS for a simulated dataset. The widely used information-theoretic model selection criterion, namely the Akaike information criterion (AIC), ranked models in terms of their predictive accuracies similar to cross-validation in the majority of cases. But further tests would be required to definitively determine whether the computationally more demanding cross-validation may be substituted with the more efficient model selection criteria, such as AIC, without much loss of accuracy. Overall, a stagewise analysis, in which the markers are omitted until at the very last stage, is recommended for GS for the tested datasets. The particular method used for marker-based prediction from the set of those currently in use is of minor importance. Hence, the widely used and thoroughly tested RR-BLUP method would seem adequate for GS for most practical purposes, because it is easy to implement using widely available software packages for mixed models and it is computationally efficient.
Extensions and applications of generalized linear mixed models for network meta-analysis of randomized controlled trials
(2022) Wiksten, Anna; Piepho, Hans-Peter
Network meta-analyses of published clinical trials has received increased attention over the past years with some meta-analytic publications having had a big impact on the cost-benefit assessment of important drugs. Much of the research has been based on Bayesian analysis using so called base-line contrast model. The research in network meta-analysis methodology has in parts been isolated from other fields of mathematical statistics and is lacking an integrative framework clearly separating statistical models and assumptions, inferential principles, and computational algorithms. The very extensive past research on ANOVA and MANOVA of un- balanced designs, variance component models, generalised linear models with fixed and/or random effects, provides a wealth of useful approaches and insights. These models are especially common in agricultural statistics and this thesis extended the use of the general statistical methods mainly applied in agricultural statistics to applications of network meta-analysis of clinical trials. The methods were applied to four different research problems in separate manuscripts. The first manuscript was based on a simulated case (based on real example) where some of the trials provided individual patient data and some only aggregated data. The outcome type considered was continuous normally distributed data. This manuscript provides models for jointly model the individual patient data and aggregated data. It was also explored how much information is lost if data is aggregated and how to quantify the amount of lost information. The second manuscript was based a real life dataset with pain medications used in acute postoperative pain. The outcome of interest was binomial, whether a subject experienced pain relief or not. The dataset used for NMA included 261 trials with 52 different treatment and dose combinations, making it extraordinarily rich and large network. The third manuscript developed methods for a case of time-to-event-outcome extracted from published Kaplan-Meier curves of survival analyses. This re-generated individual patient data was then used to model and compare the Kaplan-Meier curves and hazards of different treatments. The fourth manuscript of the thesis was tackling the problem of between-trial variance estimation for a specific method of Hartung-Knapp in classical two-treatment meta-analysis. The main finding of the paper was that in some cases random effect meta-analysis using Hartung-Knapp method may yield shorter confidence intervals for combined treatment effect than fixed effect meta-analysis and therefore the recommendation is to always compare results from Hartung-Knapp method with fixed effect meta-analysis. This thesis explored and developed the use of generalized linear mixed models in a setting of network meta-analysis of randomized clinical trials. In practice the most popular analysis method in the field of network meta-analysis has been the baseline contrast model which is usually fitted in a Bayesian framework. The baseline contrast model and Bayesian estimation provides great flexibility, but also come with some unnecessary complications for certain types of analyses. This thesis showed how methods originally developed and extensively used in agricultural research can be used in other field providing efficient calculation, estimation, and inference. Some of the examples used in this thesis arose from analyses needed for real applications in drug development and were directly used in medical research.
Genetic variation for tolerance to the downy mildew pathogen Peronospora variabilis in genetic resources of quinoa (Chenopodium quinoa)
(2021) Colque-Little, Carla; Abondano, Miguel Correa; Lund, Ole Søgaard; Amby, Daniel Buchvaldt; Piepho, Hans-Peter; Andreasen, Christian; Schmöckel, Sandra; Schmid, Karl
Background: Quinoa (Chenopodium quinoa Willd.) is an ancient grain crop that is tolerant to abiotic stress and has favorable nutritional properties. Downy mildew is the main disease of quinoa and is caused by infections of the biotrophic oomycete Peronospora variabilis Gaüm. Since the disease causes major yield losses, identifying sources of downy mildew tolerance in genetic resources and understanding its genetic basis are important goals in quinoa breeding. Results: We infected 132 South American genotypes, three Danish cultivars and the weedy relative C. album with a single isolate of P. variabilis under greenhouse conditions and observed a large variation in disease traits like severity of infection, which ranged from 5 to 83%. Linear mixed models revealed a significant effect of genotypes on disease traits with high heritabilities (0.72 to 0.81). Factors like altitude at site of origin or seed saponin content did not correlate with mildew tolerance, but stomatal width was weakly correlated with severity of infection. Despite the strong genotypic effects on mildew tolerance, genome-wide association mapping with 88 genotypes failed to identify significant marker-trait associations indicating a polygenic architecture of mildew tolerance. Conclusions: The strong genetic effects on mildew tolerance allow to identify genetic resources, which are valuable sources of resistance in future quinoa breeding.
Genomic prediction in rye
(2017) Bernal-Vasquez, Angela-Maria; Piepho, Hans-Peter
Technical progress in the genomic field is accelerating developments in plant and animal breeding programs. The access to high-dimensional molecular data has facilitated acquisition of knowledge of genome sequences in many economically important species, which can be used routinely to predict genetic merit. Genomic prediction (GP) has emerged as an approach that allows predicting the genomic estimated breeding value (GEBV) of an unphenotyped individual based on its marker profile. The approach can considerably increase the genetic gain per unit time, as not all individuals need to be phenotyped. Accuracy of the predictions are influenced by several factors and require proper statistical models able to overcome the problem of having more predictor variables than observations. Plant breeding programs run for several years and genotypes are evaluated in multi environment trials. Selection decisions are based on the mean performance of genotypes across locations and later on, across years. Under this conditions, linear mixed models offer a suitable and flexible framework to undertake the phenotypic and genomic prediction analyses using a stage-wise approach, allowing refinement of each particular stage. In this work, an evaluation and comparison of outlier detection methods, phenotypic analyses and GP models were considered. In particular, it was studied whether at the plot level, identification and removal of possible outlying observations has an impact on the predictive ability. Further, if an enhancement of phenotypic models by spatial trends leads to improvement of GP accuracy, and finally, whether the use of the kinship matrix can enhance the dissection of GEBVs from genotype-by-year (GY) interaction effects. Here, the methods related to the mentioned objectives are compared using experimental datasets from a rye hybrid breeding program. Outlier detection methods widely used in many German plant breeding companies were assessed in terms of control of the family-wise error rate and their merits evaluated in a GP framework (Chapter 2). The benefit of implementation of the methods based on a robust scale estimate was that in routine analysis, such procedures reliably identified spurious data. This outlier detection approach per trial at the plot level is conservative and ensures that adjusted genotype means are not severely biased due to outlying observations. Whenever it is possible, breeders should manually flag suspicious observations based on subject-matter knowledge. Further, removing the flagged outliers identified by the recommendedmethods did not reduce predictive abilities estimated by cross validation (GP-CV) using data of a complete breeding cycle. A crucial step towards an accurate calibration of the genomic prediction procedure is the identification of phenotypic models capable of producing accurate adjusted genotype mean estimates across locations and years. Using a two-year dataset connected through a single check, a three-stage GP approach was implemented (Chapter 3). In the first stage, spatial and non-spatial models were fitted per locations and years to obtain adjusted genotype-tester means. In the second stage, adjusted genotype means were obtained per year, and in the third stage, GP models were evaluated. Akaike information criterion (AIC) and predictive abilities estimated from GP-CV were used as model selection criteria in the first and in the third stage. These criteria were used in the first stage, because a choice had to be made between the spatial and non-spatial models and in the third stage, because the predictive abilities allow a comparison of the results of the complete analysis obtained by the alternative stage-wise approaches presented in this thesis. The second stage was a transitional stage where no model selection was needed for a given method of stage-wise analysis. The predictive abilities displayed a different ranking pattern for the models than the AIC, but both approaches pointed to the same best models. The highest predictive abilities obtained for the GP-CV at the last stage did not coincide with the models that AIC and predictive ability of GP-CV selected in the first stage. Nonetheless, GP-CV can be used to further support model selection decisions that are usually based only upon AIC. There was a trend of models accounting for row and column variation to have better accuracies than the counterpart model without row and column effects, thus suggesting that row-column designs may be a potential option to set up breeding trials. While bulking multi-year data allows increasing the training set size and covering a wider genetic background, it remains a challenge to separate GEBVs from GY effects, when there are no common genotypes across years, i.e., years are poorly connected or totally disconnected. First, an approach considering the two-year dataset connected through a single check, adjusted genotype means were computed per year and submitted to the GP stage (Chapter 3). The year adjustment was done in the GP model by assuming that the mean across genotypes in a given year is a good estimate of the year effect. This assumption is valid because the genotypes evaluated in a year are a sample of the population. Results indicated that this approach is more realistic than relying on the adjustment of a single check. A further approach entailed the use of kinship to dissect GY effects from GEBVs (Chapter 4). It was not obvious which method best models the GY effect, thus several approaches were compared and evaluated in terms of predictive abilities in forward validation (GP-FV) scenarios. It was found that for training sets formed by several disconnected years’ data, the use of kinship to model GY effects was crucial. In training sets where two or three complete cycles were available (i.e. there were some common genotypes across years within a cycle), using kinship or not yielded similar predictive abilities. It was further shown that predictive abilities are higher for scenarios with high relatedness degree between training and validation sets, and that predicting a selection of top-yielding genotypes was more accurate than predicting the complete validation set when kinship was used to model GY effects. In conclusion, stage-wise analysis is recommended and it is stressed that the careful choice of phenotypic and genomic prediction models should be made case by case based on subject matter knowledge and specificities of the data. The analyses presented in this thesis provide general guidelines for breeders to develop phenotypic models integrated with GP. The methods and models described are flexible and allow extensions that can be easily implemented in routine applications.
Genomic prediction using machine learning: a comparison of the performance of regularized regression, ensemble, instance-based and deep learning methods on synthetic and empirical data
(2024) Lourenço, Vanda M.; Ogutu, Joseph O.; Rodrigues, Rui A.P.; Posekany, Alexandra; Piepho, Hans-Peter
Background: The accurate prediction of genomic breeding values is central to genomic selection in both plant and animal breeding studies. Genomic prediction involves the use of thousands of molecular markers spanning the entire genome and therefore requires methods able to efficiently handle high dimensional data. Not surprisingly, machine learning methods are becoming widely advocated for and used in genomic prediction studies. These methods encompass different groups of supervised and unsupervised learning methods. Although several studies have compared the predictive performances of individual methods, studies comparing the predictive performance of different groups of methods are rare. However, such studies are crucial for identifying (i) groups of methods with superior genomic predictive performance and assessing (ii) the merits and demerits of such groups of methods relative to each other and to the established classical methods. Here, we comparatively evaluate the genomic predictive performance and informally assess the computational cost of several groups of supervised machine learning methods, specifically, regularized regression methods, deep , ensemble and instance-based learning algorithms, using one simulated animal breeding dataset and three empirical maize breeding datasets obtained from a commercial breeding program. Results: Our results show that the relative predictive performance and computational expense of the groups of machine learning methods depend upon both the data and target traits and that for classical regularized methods, increasing model complexity can incur huge computational costs but does not necessarily always improve predictive accuracy. Thus, despite their greater complexity and computational burden, neither the adaptive nor the group regularized methods clearly improved upon the results of their simple regularized counterparts. This rules out selection of one procedure among machine learning methods for routine use in genomic prediction. The results also show that, because of their competitive predictive performance, computational efficiency, simplicity and therefore relatively few tuning parameters, the classical linear mixed model and regularized regression methods are likely to remain strong contenders for genomic prediction. Conclusions: The dependence of predictive performance and computational burden on target datasets and traits call for increasing investments in enhancing the computational efficiency of machine learning algorithms and computing resources.
Guest editors’ introduction to the special issue on “Recent advances in design and analysis of experiments and observational studies in agriculture”
(2020) Piepho, Hans-Peter; Tempelman, Robert J.; Williams, Emlyn R.
The Journal of Agricultural, Biological and Environment Statistics (JABES) special issue on Recent Advances in Design and Analysis of Experiments and Observational Studies in Agriculture covers a select set of topics currently of primary importance in the field. Efficient use of resources in agricultural research, as well as valid statistical inference, requires good designs, and this special issue boasts seven papers providing both review and cutting-edge methodology for the purpose. A broad range of methods for analysis of data arising in different branches agricultural research is covered in another five exciting papers. This special issue highlights the importance of and opportunities for applied statistics in agriculture.
Highlighting the potential of multilevel statistical models for analysis of individual agroforestry systems
(2023) Golicz, Karolina; Piepho, Hans-Peter; Minarsch, Eva-Maria L.; Niether, Wiebke; Große-Stoltenberg, André; Oldeland, Jens; Breuer, Lutz; Gattinger, Andreas; Jacobs, Suzanne
Agroforestry is a land-use system that combines arable and/or livestock management with tree cultivation, which has been shown to provide a wide range of socio-economic and ecological benefits. It is considered a promising strategy for enhancing resilience of agricultural systems that must remain productive despite increasing environmental and societal pressures. However, agroforestry systems pose a number of challenges for experimental research and scientific hypothesis testing because of their inherent spatiotemporal complexity. We reviewed current approaches to data analysis and sampling strategies of bio-physico-chemical indicators, including crop yield, in European temperate agroforestry systems to examine the existing statistical methods used in agroforestry experiments. We found multilevel models, which are commonly employed in ecology, to be underused and under-described in agroforestry system analysis. This Short Communication together with a companion R script are designed to act as an introduction to multilevel models and to promote their use in agroforestry research.
Impact of plastic rain shields and exclusion netting on pest dynamics and implications for pesticide use in apples
(2025) Bischoff, Robert; Piepho, Hans-Peter; Scheer, Christian; Petschenka, Georg
Apple production is among the most pesticide-intensive cultures. Recently, plastic rain shields and pest exclusion netting have emerged as potential measures to reduce the heavy reliance on chemical pesticides in apple, due to their inhibitory effect on pathogen and pest infestations. In a field trial, we compared yields, pest, and pathogen abundance in an orchard consisting of four plots, where two plots were covered with anti-hail net covers, one with plastic rain shields only, and one with plastic rain shields and exclusion netting. Pests and pathogens were assessed visually, and beating tray samples were collected to compare overall arthropod diversity between plots. We observed virtually no scab infections in both plastic rain shield plots, despite a more than 70% reduction of fungicides applied, when compared to anti-hail plots. Although no codling moth insecticides were sprayed in the plot with exclusion netting we found significantly reduced damage here, when compared to the anti-hail plots. However, likely due to microclimatic changes, we observed an increase of powdery mildew, woolly apple aphids, and spider mites under plastic rain shields. Modeling of metabolic rates of arthropod herbivores and predators revealed that there is an increased potential of herbivory under plastic rain shields. However, in terms of plant protection, the net effect of plastic rain shields and exclusion netting was a substantial reduction in chemical pesticide use, demonstrating that they represent a promising approach to minimize the use of chemical pesticides in apple production.
Improvement of breeding strategies for the trait vase life in cut carnations (Dianthus caryophyllus L.)
(2018) Boxriker, Maike; Piepho, Hans-Peter
Carnation (Dianthus caryophyllus L.) is one of the ten most famous cut flowers worldwide. A single big flower characterizes standard carnations, while mini car-nations possess multiple flowers per stem. Vase life (VL) is one of the most im-portant breeding objectives in carnations due to the need of long transportation times and direct influence on the costumers. But VL is a complex trait with several effects influencing it. Two-phase traits like VL are traits where the assessment is done in a second phase, in the laboratory and the plants are cultivated in the greenhouse, the first phase. Many experiments have a two-phase character, but little research has been conducted to develop experimental designs in the second phase. To improve breeding efficiency, molecular markers and genomic selection is used in agriculture science but it is so far not common in ornamental breeding. The goal of this thesis was the implementation of SNP-based molecular markers for the trait VL to improve selection of long-lasting, transportable cut carnations. For marker association, 1,500 carnation genotypes were screened for VL behav-ior in an experimental design in both phases. Response to selection was used to assess efficiency. The second-phase experimental design was more important for precise data analyses. This highlights the research need on this topic. Fur-thermore, it was possible to suggest row-column designs for VL trials. Row-column designs are more flexible in the case of positional effects compared with one-dimensional blocking and can be easily analyzed like an α-design. The easiest way to design the following phases are to apply the design one-to-one. The carnation types, mini and standard, showed an influence on VL. The mini carnations last 0.5 d longer than the standard carnations. The same conclusion was drawn based on the molecular data. Transcriptome data was generated with two different sequencing methods. By independent analysis of both carnation types, different results than via the analysis of the whole data set were found. This indicates that the analysis of carnations should be done separately for each carnation type. Association of the phenotypic and genotypic data was so far not possible. As an alternative to molecular markers, genetic correlations for the use as indirect selection for the trait VL and others for breeding relevant traits was calculated. For the first time, bivariate analysis was conducted in two-phase ex-periments. The genotypic correlation between VL and FD was high, but indirect selection would be less effective than direct selection. However, the information can provide an indication of the performance and the effort to measure FD is small. The calculated high heritability of VL and found differences in VL of up to 15 d between the best and worst genotypes showed the potential of improving the population mean by using improved selection strategies like marker-assisted selection or auxiliary traits and the use of statistical methods like experimental designs in all phases of the experiment. The influence of carnation type was shown with this thesis and indicates that the implementation of molecular markers must be done independently for each car-nation type. The importance of experimental designs in multi-phase experiments was highlighted and statistical analysis by mixed models and a bivariate analysis of different traits was performed. Until now, no molecular marker for VL was identified but in a further research project, this will be solved by generating more genotypic data and the construction of a genetic map.