**Identify Species with good Protein sequences for Orthofinder and CAFE Analysis** *K. Castellano - October 19,2021* _genome file location:_ 
/data/usr/kcastellano/redSeaUrchin_OrthoFinder_otherGenomes/ _excel sheet:_ Species_for_gain_loss_8Oct2021.xlsx A. Download appropriate protein files from 
available genomes for comparison (see excel sheet)
    1. Check quality with BUSCO _program and version:_ BUSCO v5.1.2 _lineage database:_ metazoa ODBv10 _input:_ protein files and log file (busco_v5.log) 
            GCF_000003605.2_Skow_1.1_protein.faa GCF_001949145.1_Aplanci_OKI-Apl_1.0_genomic.fna GCF_902459465.1_eAstRub1.3_protein.faa 
            GCF_000003815.2_Bfl_VNyyK_protein.faa GCF_011630105.1_Ajaponica_ASM1163010v1_protein.faa 
            GCA_002754855.1_Ajaponicus_seacucumber_ASM275485v1_protein.faa GCF_000224145.3_KH_protein.faa GCF_013122585.1_Sclava_ASM1312258v2_protein.faa
        /data/usr/kcastellano/redSeaUrchin_OrthoFinder_otherGenomes/ _output:_ folder with BUSCO statistics for each protein file 
            GCF_000003605.2_Skow_1.1_protein GCF_001949145.1_Aplanci_OKI-Apl_1.0_genomic GCF_902459465.1_eAstRub1.3_protein GCF_000003815.2_Bfl_VNyyK_protein 
            GCF_011630105.1_Ajaponica_ASM1163010v1_protein GCA_002754855.1_Ajaponicus_seacucumber_ASM275485v1_protein GCF_000224145.3_KH_protein 
            GCF_013122585.1_Sclava_ASM1312258v2_protein
        /data/usr/kcastellano/redSeaUrchin_OrthoFinder_otherGenomes/ _script:_ /data/usr/kcastellano/redSeaUrchin_OrthoFinder_otherGenomes/busco_v5.sh *BUSCO 
         data also in excel sheet*
        _NOTES:_ All have really nice completeness scores but also really high duplication rates 2. Test using usearch to cluster to reduce duplication rate - 
    DO NOT USE THIS METHOD
        a. remove line breaks - test on A. japonica protein file ``` awk '/^>/ { print (NR==1 ? "" : RS) $0; next } { printf "%s", $0 } END { printf RS }' 
            GCF_011630105.1_Ajaponica_ASM1163010v1_protein.faa > GCF_011630105.1_Ajaponica_ASM1163010v1_protein_nolinebreak.faa ```
        b. cluster *test identity cutoff of 0.9 and 0.95* _program and version:_ usearch v11.0.667_i86linux64 _input:_ 
            GCF_011630105.1_Ajaponica_ASM1163010v1_protein_nolinebreak.faa /data/usr/kcastellano/redSeaUrchin_OrthoFinder_otherGenomes/Ajaponica _output:_ 
            GCF_011630105.1_Ajaponica_ASM1163010v1_protein_collapsed.faa (0.9 identity cutoff) 
            GCF_011630105.1_Ajaponica_ASM1163010v1_protein_collapsed_0.95.faa /data/usr/kcastellano/redSeaUrchin_OrthoFinder_otherGenomes/Ajaponica
        c. run BUSCO to assess filtering _input:_ clustered protein files GCF_011630105.1_Ajaponica_ASM1163010v1_protein_collapsed.faa (0.9 identity cutoff) 
            GCF_011630105.1_Ajaponica_ASM1163010v1_protein_collapsed_0.95.faa /data/usr/kcastellano/redSeaUrchin_OrthoFinder_otherGenomes/Ajaponica _output:_ 
            folder with busco summary GCF_011630105.1_Ajaponica_ASM1163010v1_protein_collapsed (0.9 identity cutoff) 
            GCF_011630105.1_Ajaponica_ASM1163010v1_protein_collapsed_0 (0.95 identity cutoff) 
            /data/usr/kcastellano/redSeaUrchin_OrthoFinder_otherGenomes/Ajaponica _script:_ 
            /data/usr/kcastellano/redSeaUrchin_OrthoFinder_otherGenomes/Ajaponica/busco_v5.sh _NOTES:_ Did reduce the duplicated % but also increased 
            fragmentation %. Also concerned about collapsing real duplicates so decided this was not the best method
    3. Test pulling out the longest isoform using the gene feature table - DO NOT USE THIS METHOD *test on A. japonica protein file*
        
        ```
        #1. get longest isoform from gene feature table
            cat GCF_011630105.1_ASM1163010v1_feature_table.txt \
            | awk 'BEGIN{FS="\t";OFS="\t"}($1=="CDS"){print $11,$13,$16,$19}' \ sort -k3,3 -k4,4nr \ awk 'BEGIN{FS="\t";OFS="\t";l=0;g=0}{if($3==g) {if 
            | ($4>=l) \
            {print $0; g=$3; l=$4}} else {print $0; g=$3; l=$4} }' > GCF_011630105.1_Ajaponica_longestIso.txt
        #2. get just the first column with transcript Id
            awk '{print $1}' GCF_011630105.1_Ajaponica_longestIso.txt > GCF_011630105.1_Ajaponica_longestIso_2.txt
        #3. get only the longest isoform sequences
            grep -A 1 -wFf GCF_011630105.1_Ajaponica_longestIso_2.txt GCF_011630105.1_Ajaponica_ASM1163010v1_protein_nolinebreak.faa > 
            GCF_011630105.1_Ajaponica_longestIso.faa grep ">" GCF_011630105.1_Ajaponica_longestIso.faa | wc -l
        #4. remove "-- characters that appear for some reason (checked and the sequences the come after are not effected, still full length when compared to 
        #the original file)
            sed -i 's/--//g' GCF_011630105.1_Ajaponica_longestIso.faa
                #the above command leaves empty lines - so this will remove the empty lines (does not work in a pipe - also sed '/--/d' removes other lines 
                #for some reasons)
            sed -i '/^$/d' GCF_011630105.1_Ajaponica_longestIso.faa
        #5. check to make sure the number of sequences is the same
            grep ">" GCF_011630105.1_Ajaponica_longestIso.faa | wc -l
                #25404
    4. Try gfacs to collapse uniques *pull out the ref seq gff file and genome file from NCBI* *the --unique-genes-only flag which will collapse directly 
        overlapping genes and resolve transcripts created* a. edit files so the names in the genome match the gff file
            ```
            #1. Edit header to match gff file
                #remove all characters after first space
                sed -i 's/\s.*$//' <file>_genomic.fasta
            #A.japonicus
            sed -i 's/\s.*$//' GCA_002754855.1_ASM275485v1_genomic.fna
            #A.rubens
            sed -i 's/\s.*$//' GCF_902459465.1_eAstRub1.3_genomic.fna
                #after gfacs repeat lines but one has no sequence below it - this has happened to me before listed when BUSCO tries to run because it cant run 
                #on repeat headers get lines and line number containing the specificed ID
                grep -n -A 2 "LOC117300036.1" Arubens_genes.fasta.faa 19019:>ID=gene-LOC117300036.1 19020:>ID=gene-LOC117300036.1 
                    19021-MGTWEPLFGRRGGSLMPLPIRKNSLCLQNKEVQTDMQQIDDTHVLFNIFNAATINHIVVFLTGTAAVPEGMGGAIYFSWPSPEGEMVWMFLGFITNEKPSAIFKVVGLKKGSSTNNSFMQSMPVQQPMKTMSQIGISVEPLHVIQHQIPAETTQTSTLTSFSEFSMKMLENFYNYASSF*
                #remove false header
                sed -i '19019d' Arubens_genes.fasta.faa
            #P. minimata
            sed -i 's/\s.*$//' GCF_015706575.1_ASM1570657v1_genomic.fasta
            #A. planci
            sed -i 's/\s.*$//' GCF_001949145.1_OKI-Apl_1.0_genomic.fasta
                #remove duplicate line
                grep -n -A 1 "LOC110983006.2" Aplanci_genes.fasta.faa 29723:>ID=gene-LOC110983006.2 29724:>ID=gene-LOC110983006.2 29725-MGSNQSNTTKVIG sed -i 
                '29723d' Aplanci_genes.fasta.faa
            #P. miniata
                #remove duplicate line
                grep -n -A 1 "LOC119730449.6" Pminiata_genes.fasta.faa 20547:>ID=gene-LOC119730449.6 20548:>ID=gene-LOC119730449.6 
                20549-MLLSSKKPHVCETSSKAFTEKSSPRIHDRV sed -i '20547d' Pminiata_genes.fasta.faa sed -i '29479d' Pminiata_genes.fasta.faa
                #two repeat headers
            #S. kowalevskii
            sed -i 's/\s.*$//' GCF_000003605.2_Skow_1.1_genomic.fasta grep -n -A 1 "LOC110983006.2" GCF_000003605.2_Skow_1.1_genomic.fasta
            #replace / in name
            sed -i 's/\//-/' Skowalevskii_genes.fasta.faa
            #C. intestinalis
            sed -i 's/\s.*$//' GCF_000224145.3_KH_genomic.fasta sed -i 's/\//-/' Cintestinalis_genes.fasta.faa sed -i 's/\s.*$//' 
            GCF_013122585.1_ASM1312258v2_genomic.fasta
            #B. florida
            sed -i 's/\s.*$//' GCF_000003815.2_Bfl_VNyyK_genomic.fasta
                #remove duplicate line
                grep -n -A 1 "gene-LOC118415711.2" Bflorida_genes.fasta.faa sed -i '15243d' Bflorida_genes.fasta.faa
            #S. purpuratus (5.0)
            sed -i 's/\s.*$//' sp5_0_GCF_genomic.fa
            #remove "/" in headers
            sed -i 's/\//-/' genes.fasta.faa
            #L variegatus
            sed -i 's/\s.*$//' GCF_018143015.1_Lvar_3.0_genomic.fna ```
        
        b. run gFACS with unique flag only _input:_ genome file, gff file for each species all species have their only folder in this location: 
            /data/usr/kcastellano/redSeaUrchin_OrthoFinder_otherGenomes _output:_ summary statistics, log file, filtered protein file, gene table in each of 
            the species folders /data/usr/kcastellano/redSeaUrchin_OrthoFinder_otherGenomes _script:_ gFACs_uniquesOnly.sh in each of the species folders 
            located here: /data/usr/kcastellano/redSeaUrchin_OrthoFinder_otherGenomes
        c. run busco on each filtered file (see excel sheet: Species_for_gain_loss_8Oct2021.xlsx) _input:_ filtered protein file in each of the species 
            folders located here: /data/usr/kcastellano/redSeaUrchin_OrthoFinder_otherGenomes _output:_ summary file in each of the species folders located 
            here: /data/usr/kcastellano/redSeaUrchin_OrthoFinder_otherGenomes _script:_ busco_v5.sh in each of the species folders located here: 
            /data/usr/kcastellano/redSeaUrchin_OrthoFinder_otherGenomes
        _NOTES:_ filtering with gFACs barely adjusted the completeness which is good but reduced the completeness. I think this is the best method as it only 
        picks uniques that overlap genomic space.
List of files for each species after filtering for uniques: /data/usr/kcastellano/redSeaUrchin_OrthoFinder_otherGenomes/Aplanci/Aplanci_genes.fasta.faa 
/data/usr/kcastellano/redSeaUrchin_OrthoFinder_otherGenomes/Arubens/Arubens_genes.fasta.faa 
/data/usr/kcastellano/redSeaUrchin_OrthoFinder_otherGenomes/Bflorida/Bflorida_genes.fasta.faa 
/data/usr/kcastellano/redSeaUrchin_OrthoFinder_otherGenomes/Cintestinalis/Cintestinalis_genes.fasta.faa 
/data/usr/kcastellano/redSeaUrchin_OrthoFinder_otherGenomes/Lpictus/lytpic2.0.all.maker.proteins.LPI.fasta 
/data/usr/kcastellano/redSeaUrchin_OrthoFinder_otherGenomes/Lvariegatus3.0/Lvariegatus_genes.fasta.faa 
/data/usr/kcastellano/redSeaUrchin_OrthoFinder_otherGenomes/Pminiata/Pminiata_genes.fasta.faa 
/data/usr/kcastellano/redSeaUrchin_OrthoFinder_otherGenomes/Sclava/Sclava_genes.fasta.faa 
/data/usr/kcastellano/redSeaUrchin_OrthoFinder_otherGenomes/Skowalevskii/Skowalevskii_genes.fasta.faa 
/data/usr/kcastellano/redSeaUrchin_OrthoFinder_otherGenomes/Spurpuratus5.0/Spurpuratus5.0_genes.fasta.faa 
/data/usr/kcastellano/redSeaUrchin_OrthoFinder_otherGenomes/Ajaponicus/GCA_002754855.1_Ajaponicus_seacucumber_ASM275485v1_protein.faa 
/data/usr/kcastellano/redSeaUrchin_OrthoFinder_otherGenomes/Ajaponica_uniques/Ajaponica_uniques_genes.fasta.faa