(C) PLOS One [1]. This unaltered content originally appeared in journals.plosone.org. Licensed under Creative Commons Attribution (CC BY) license. url:https://journals.plos.org/plosone/s/licenses-and-copyright ------------ DAJIN enables multiplex genotyping to simultaneously validate intended and unintended target genome editing outcomes ['Akihiro Kuno', 'Department Of Anatomy', 'Embryology', 'Faculty Of Medicine', 'University Of Tsukuba', 'Tsukuba', 'Ph.D Program In Human Biology', 'School Of Integrative', 'Global Majors', 'Yoshihisa Ikeda'] Date: 2022-02 Genome editing can introduce designed mutations into a target genomic site. Recent research has revealed that it can also induce various unintended events such as structural variations, small indels, and substitutions at, and in some cases, away from the target site. These rearrangements may result in confounding phenotypes in biomedical research samples and cause a concern in clinical or agricultural applications. However, current genotyping methods do not allow a comprehensive analysis of diverse mutations for phasing and mosaic variant detection. Here, we developed a genotyping method with an on-target site analysis software named Determine Allele mutations and Judge Intended genotype by Nanopore sequencer (DAJIN) that can automatically identify and classify both intended and unintended diverse mutations, including point mutations, deletions, inversions, and cis double knock-in at single-nucleotide resolution. Our approach with DAJIN can handle approximately 100 samples under different editing conditions in a single run. With its high versatility, scalability, and convenience, DAJIN-assisted multiplex genotyping may become a new standard for validating genome editing outcomes. Funding: Grant number 19H03142 to S.M. and A.K. from the Ministry of Education, Culture, Sports, Science, and Technology. Grant number 20ae0201011h0003 to S.M. and S.T. from the Japan Agency for Medical Research and Development. Grant number JPMJPF2017 to S.T. and A.Y. from the Japan Science and Technology Agency. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. Data Availability: All underlying data can be found in the Supporting Information files deposited at the OSF repository ( https://osf.io/w7ade/ ). DAJIN is accessible at https://github.com/akikuno/DAJIN under the MIT Licence. The version of DAJIN used in this study to reproduce the analyses can be found at https://github.com/akikuno/DAJIN/tree/manuscript-version . All sequencing data are available in the DDBJ DRA under accession number DRA011971 ( https://ddbj.nig.ac.jp/resource/sra-submission/DRA011971 ). Herein, we describe a novel method for analysing genome editing outcomes, in which long-chain PCR products with barcodes obtained using 2-step long-range PCR were used as samples, and allele validation was performed using our original software named Determine Allele mutations and Judge Intended genotype by Nanopore sequencer (DAJIN) that enables the comprehensive analysis of long reads generated using the nanopore long-read sequencing technology. DAJIN, a machine learning–based model, identifies and quantifies allele numbers and their mutation patterns and reports consensus sequences to visualise mutations in alleles at single-nucleotide resolutions. Moreover, it allows multiple sample processing, and approximately 100 samples can be processed within a day. Because of these features, our strategy with DAJIN can validate the quality of genome-edited samples to select animals or clones with intended results efficiently and as such has the potential to contribute to more precise genome editing. The assessment of on-target editing outcomes and the selection of correct, precisely edited alleles lead to efficient production and breeding of founder animals and their offspring as well as efficient in vivo and ex vivo engineering. Demultiplexing of highly homologous mutated alleles is required to separate the signals of each allele from genetically engineered samples. However, the subcloning of amplified products is laborious, and short-range assessments with targeted PCR amplification and tracking of indels by decomposition analysis of Sanger sequencing data are likely to miss long-range mutation events, which may result in pathogenic phenotypes through unintended changes in gene expression [ 17 , 18 ]. Moreover, short-range PCR analysis followed by illumina-based short-read next-generation sequencing (short-read NGS) cannot identify multiple intended or unwanted mutations in cis or in trans [ 19 , 20 ]. Long-read sequencing technologies enable a comprehensive analysis of the region of interest by providing longer sequence reads compared to the traditional strategy and make it possible to identify unexpected genome editing outcomes, including complex structural variations (SVs) [ 10 , 13 ]. Although targeted long-read sequencing allows the detection of complex on-target mutations over several kilobases [ 13 , 21 ], this method has instrumental limitations such as error rates and lack of tools for phasing and mosaic variant detection to validate multiple and diverse allelic variants to a single-base level [ 22 ]. Thus, more accessible and high-throughput methods for routine assessment of genome editing outcomes are essential to detect the unpredictable editing events. Cell populations with incorrectly edited alleles need to be detected and excluded to ensure precise genome editing [ 8 ]. Unintended alleles with similar genetic impact may be tolerated only in a specific purpose of genome editing, for instance, generation of null alleles through the deletion of critical exon(s) by using multiple guide RNAs (gRNAs), resulting in multiple patterns in the total deleted length [ 9 ]. Recent studies have found that genome editing can induce various on-target events such as inversions, deletions, and endogenous and exogenous DNA insertions as well as indels and substitutions at, and in some cases, away from the target site [ 10 – 13 ]. Furthermore, there is a possibility of gene conversion between homologous regions following genomic DNA cleavage [ 14 – 16 ]. The development of new technologies such as CRISPR-Cas has facilitated genome editing of any species or cell type. Nucleases such as Cas9 and FokI and deaminase fused with Cas9 have been used to introduce DNA double-strand breaks and perform base editing, respectively [ 1 – 3 ]. However, as double-strand break repair pathways are regulated by host cells [ 4 ], verifying the result and selecting desired mutated alleles for precise genome editing are essential. Multiple alleles exist in a population of cells or individual animals that have undergone genome editing. In most cases, animals born following editing events at early embryonic stages are mosaic [ 5 ]. Heterogeneous cell populations can be obtained by genome editing of cultured cells or delivering genome editing tools to somatic cells [ 6 , 7 ]. DAJIN generates BAM files to visualise the DAJIN-reported alleles in a genome browser. First, DAJIN uses minimap2 to map the nanopore sequence reads to the WT sequences described in the user-inputted FASTA file, then samtools (version 1.10) [ 32 ] generates sorted BAM files. Next, the target genome coordinates and chromosome length are obtained from the UCSC Table Browser [ 33 ] according to the user-inputted FASTA file and genome assembly ID. Then, DAJIN replaces the chromosome number and chromosome length in SN and LN headers of BAM files. To improve the interpretability, DAJIN has a default setup to remove minor alleles. Minor alleles were defined as those in which the number of reads was 1% or less of the total number of reads of a sample. DAJIN was able to report all allele information using the “filter = off” option. To mitigate nanopore sequencing errors, the MIDS’s relative frequencies of the sample were subtracted from the control reads. We call the subtracted MIDS relative frequencies “MIDS score” ( S7 Fig ). The MIDS score was reduced into 5 dimensions using principal component analysis (PCA). Then, hierarchical density-based spatial clustering of applications with noise (HDBSCAN) [ 31 ] was performed for the allele clustering. For parameters, we set “min_samples,” which specifies the minimum size of each cluster formed, as “1” to maximise valid reads. Furthermore, we tuned “min_cluster_size,” which defines the minimum number of samples in each cluster. We set the value as 50 equal intervals between 10% and 40% of the total number of reads and then selected the “min_cluster_size,” which outputs the mode of cluster numbers. In order to distinguish each allele precisely, DAJIN conducts compressed MIDS conversion and clustering. To generate fixed-length sequences, we performed compressed MIDS conversion, which replaces successive insertions with a character corresponding to the number of insertions and then substitutes the insertion ( S6 Fig ). A character is assigned to the number from 1 to 9 or a letter from a to z. If the number of consecutive insertions is in the range 1 to 9, the character is the corresponding number. If the number of consecutive insertions is in the range 10 to 35, the character is “a” (= 10) to “y” (= 35). If the number is greater than 35, the character is “z” (>35). We constructed a DNN model to classify alleles. The structure of the deep learning model is shown in S5 Fig . The architecture comprises 3 layers of convolutional and max-pooling layers and a fully connected (FC) layer, and a softmax function to predict the allele types. The batch size was 32. The maximum number of training epochs was 200, and the training was stopped when validation loss was not improved during 20 epochs. To detect reads with large rearrangements, we extracted the outputs from the FC layer. Then, we trained the local outlier factor (LOF) [ 30 ] using the output of the simulated reads. Subsequently, the output of the nanopore sequence reads was placed in the LOF; it annotated unexpected mutation reads as “large rearrangements (LARs),” which we define in this manuscript the name of genomic rearrangements more than around 50 bp in length. We assessed the accuracy of the classification using simulation reads, and it was able to accurately classify alleles in all genome editing designs conducted in this study ( S8 Table ). The extracted reads were subjected to MIDS conversion ( S4B Fig ). The matched, inserted, deleted, and substituted bases compared to the control sequence were converted to M (Match), I (Insertion), D (Deletion), and S (Substitution), respectively. Next, the read lengths were trimmed or padded with “=“ to equalise their sequence length. Then, one-hot encoding was performed on the MIDS sequence. We performed preprocessing to exclude reads without target loci and to perform Match, Insertion, Deletion, and Substitution (MIDS) conversion. First, the genome-edited sequence was aligned to the user-provided WT sequence using minimap2 (version 2.17) [ 29 ] with the “—cs = long” option, and the position of the target mutant base was detected according to the CS-tag in the SAM file. Simulated and nanopore sequencing reads were then aligned using minimap2 to the WT sequence. Reads with lengths more than 1.1 times longer than the maximum length among possible alleles were excluded. For the remaining reads, we detected the start and end positions of each read relative to the WT sequence based on CIGAR information and extracted the reads containing the mutant region of interest ( S4A Fig ). To prepare training data for deep neural network (DNN) models, we generated simulation reads of the possible alleles using NanoSim (version 2.5.0) [ 28 ]. We trained NanoSim to obtain an error profile using nanopore sequencing reads from a wild-type (WT) control. Next, we applied the error profile to generate 10,000 simulation reads per each possible allele that could be caused by genome editing ( S1 Fig ). In the PM design, we generated simulation reads with a deletion or random nucleotide insertion of the gRNA length at the Cas-cutting site. Genomic PCR was performed using AmpliTaq Gold 360 DNA Polymerase (Thermo Fisher Scientific) and the relevant primers whose barcode sequences were added to the 5′ end ( S7 Table ). The PCR amplicons in 226 bp (Tyr.c140 G>C) and 203 bp (Tyr c.316 G>C, Tyr c.308 G>C) lengths were purified using FastGene Gel/PCR Extraction Kit (Nippon Genetics, Düren, Germany). Paired-end sequencing (2 × 151 bases) with these purified amplicons was performed using MiSeq (Illumina, San Diego, CA, USA) at Tsukuba i-Laboratory LLP (Tsukuba, Ibaraki, Japan). Paired-end reads were mapped against chromosome 7 of mouse genome assembly mm10 using STAR (version2.7.0a) with default settings [ 26 ]. Mapped reads were visualised using IGV (version 2.9.4) [ 27 ]. The samples carrying the intended PM at frequencies >10% were considered as positive. To evaluate the validity of DAJIN’s genotyping results, we used conventional genotyping methods, including short-amplicon PCR, PCR-RFLP, and Sanger sequencing. For the genotyping of the 2-cut KO and PM lines, genomic PCR was performed using AmpliTaq Gold 360 DNA Polymerase (Thermo Fisher Scientific) and the relevant primers ( S6 Table ). Agarose gel electrophoresis was performed to confirm the size of the PCR products. In the flox knock-in (KI) design, genomic PCR was performed using KOD FX (Toyobo) and the relevant primers ( S6 Table ). The PCR products were digested with restriction enzymes AscI (New England Biolabs) and EcoRV (New England Biolabs) for 2 h to check LoxP insertion on the left and right side, respectively. Agarose gel electrophoresis was performed to confirm the size of the PCR fragments. PCR products with mutant sequences were identified using Sanger sequencing using the BigDye Terminator v3.1 Cycle Sequencing Kit (Thermo Fisher Scientific). (a) The schematic of DAJIN’s workflow. DAJIN automates the procedures highlighted in grey. Red-coloured nucleotides represent intended PM. A green-highlighted nucleotide represents unintended substitution. Illustrations were modified from the Togo picture gallery, licenced under CC-BY 4.0 Togo picture gallery by the DBCLS, Japan. (b) The outputs of DAJIN. The file formats are described in parentheses. See S1 Data for raw data from https://osf.io/w7ade/ . DAJIN, Determine Allele mutations and Judge Intended genotype by Nanopore sequencer; DBCLS, Database Center for Life Science; LAR, large rearrangement; PM, point mutation; WT, wild type. We used PI-200 (Kurabo Industries, Osaka, Japan), according to the manufacturer’s protocol, for the extraction and purification of genomic DNA obtained from the tail of mice. The purified genomic DNA was amplified using PCR using KOD multi & Epi (Toyobo, Osaka, Japan) and target amplicon primers ( S3 Table ). In the target amplicon primer, the universal sequence (22 mer) is located on the 5′ side, and the sequence for target gene amplification is on the 3′ side. Five-fold dilutions of the PCR products were used as templates for nested PCR performed using KOD multi & Epi and barcode attachment primers ( S4 Table ). The 5′ side of the barcode attachment primer has a barcode sequence (24 mer), and the 3′ sequence is annealed to the universal sequence of the target amplicon primer ( Fig 1A ). The barcoded PCR products were mixed in equal amounts and then purified using FastGene Gel/PCR Extraction Kit (Nippon Genetics, Germany). The volume of the mixed and purified PCR products was adjusted to 20 to 30 ng/μL. The library was prepared using Ligation Sequencing 1D kit (SQK-LSK108_109; ONT, Oxford, UK) and NEBNext End repair/dA-tailing Module NEB Blunt/TA Ligase Master Mix (New England Biolabs, Ipswich, MA, USA) according to the manufacturer’s instructions. The prepared library was loaded onto a primed R9.4 Spot-On Flow cell (FLO-MIN106; ONT, Oxford, UK). The 24-h or 36-h run time calling sequencing protocol was selected in the MinKNOW GUI (version 4.0.20), and base calling was allowed to complete after the sequencing run was completed. After base calling, we demultiplexed the barcoding libraries using qcat (version 1.1.0) with default parameter settings. Total nanopore sequencing reads per sample are listed in S5 Table . Mice with point mutations (PMs) and 2-cut knockout (KO) were generated using the electroporation method [ 23 ]. The gRNA target sequences to induce each mutation are listed in S1 Table . The gRNAs were synthesised and purified using a GeneArt Precision gRNA Synthesis Kit (Thermo Fisher Scientific, Waltham, MA, USA) and dissolved in Opti-MEM (Thermo Fisher Scientific). In addition, we designed 3 single-strand oligodeoxynucleotides (ssODNs) donors for inducing PMs in Tyr ( S1 Table ). These ssODN donors were ordered as Ultramer DNA oligos from Integrated DNA Technologies (Coralville, IA, USA) and dissolved in Opti-MEM. The mixtures of gRNA (5 ng/μL) and ssODNs (100 ng/μL) or mixtures of 2 gRNAs (25 ng/μL each) were used to generate point mutant mice or 2-cut KO mice, respectively. GeneArt Platinum Cas9 Nuclease (100 ng/μL; Thermo Fisher Scientific) was added to these mixtures. Pregnant mare serum gonadotropin (5 units) and human chorionic gonadotropin (5 units) were intraperitoneally injected into female C57BL/6J mice (Charles River Laboratories) with a 48-h interval. Next, unfertilised oocytes were collected from their oviducts. Then, according to standard protocols, we performed in vitro fertilisation with these oocytes and sperm from male C57BL/6J mice (Charles River Laboratories). After 5 h, the abovementioned gRNA/ssODN/Cas9 or 2 gRNAs/Cas9 mixtures were electroplated into the mouse zygotes using a NEPA 21 electroplater (NEPAGNENE; Chiba, Japan), under previously reported conditions [ 24 ]. The electroporated embryos that developed into the 2-cell stage were transferred to oviducts of pseudopregnant ICR female mice. The floxed mice were generated using the microinjection method [ 25 ]. Each gRNA target sequence ( S1 Table ) was inserted into the entry site of pX330-mC carrying both the gRNA and Cas9 expression units. These pX330-mC plasmid DNAs and donor DNA plasmid were isolated using FastGene Plasmid Mini kit (Nippon Genetics, Tokyo, Japan) and filtered using MILLEX-GV 0.22 μm filter unit (Merck Millipore, Darmstadt, Germany) for microinjection. Next, C57BL/6J female mice superovulated using the method described above were naturally mated with male C57BL/6J mice, and zygotes were collected from the oviducts of the mated female mice. For each gene, a mixture of 2 pX330-mC (circular, 5 ng/μL each) and a donor (circular, 10 ng/μL) was microinjected into the zygote. The zygotes that survived were then transferred into the oviducts of pseudopregnant ICR female mice. When the newborns were around 3 weeks of age ( S2 Table ), the tail was sampled to obtain genomic DNA. ICR and C57BL/6J mice were purchased from Charles River Laboratories Japan (Yokohama, Japan). C57BL/6J-Tyr em2Utr mice were provided by RIKEN BRC (#RBRC06459). Mice were kept in plastic cages under specific pathogen-free conditions in a room maintained at 23.5 ± 2.5°C and 52.5 ± 12.5% relative humidity under a 14-h light:10-h dark cycle. Mice had free access to commercial chow (MF diet; Oriental Yeast, Tokyo, Japan) and filtered water. All animal experiments were performed humanely with the approval from the Institutional Animal Experiment Committee of the University of Tsukuba following the Regulations for Animal Experiments of the University of Tsukuba and Fundamental Guidelines for Proper Conduct of Animal Experiments and Related Activities in Academic Research Institutions under the jurisdiction of the Ministry of Education, Culture, Sports, Science, and Technology of Japan. The IACUC approval number for this animal experiment was UT_19–003. The euthanasia was performed by cervical dislocation by a skilled person in adult mice and by decapitation with sufficiently keen dissection scissors in newborn mice. Results Workflow of DAJIN We designed DAJIN to genotype genome-edited samples by capturing diverse mutations from SNVs to LARs that covers genomic rearrangements more than approximately 50 bp in length. The overall workflow of DAJIN is presented in Fig 1A. DAJIN requires (1) a FASTA file describing possible alleles, which must include the DNA sequence before and after genome editing; (2) FASTQ files from nanopore sequencing, which include a control sample; (3) gRNA sequence including the protospacer adjacent motif (PAM); and (4) a genome assembly ID such as hg38 and mm10. Next, DAJIN generates simulation reads using NanoSim [28] according to the user-inputted FASTA file. The sequence reads are preprocessed and one-hot encoded. Subsequently, the simulated reads are used to train a DNN model to detect LAR reads and classify allele types. DAJIN defines LAR alleles as a different sequence from the user-inputted FASTA file. Next, DAJIN conducts clustering to estimate the alleles. Finally, it reports the consensus sequence to visualise the mutations in each allele and labels the alleles. The details are described in Methods and S1 and S4–S7 Figs. The outputs of DAJIN are shown in Fig 1B. DAJIN reports allele frequencies in each sample, the consensus sequences, and BAM files for each allele. In this study, DAJIN was evaluated on 9 mouse strains of 3 types of genome editing design: PM, 2-cut KO, and flox KI. The performance evaluations are described in detail below. [END] [1] Url: https://journals.plos.org/plosbiology/article?id=10.1371/journal.pbio.3001507 (C) Plos One. "Accelerating the publication of peer-reviewed science." Licensed under Creative Commons Attribution (CC BY 4.0) URL: https://creativecommons.org/licenses/by/4.0/ via Magical.Fish Gopher News Feeds: gopher://magical.fish/1/feeds/news/plosone/