Available Commands

Below is a list of the available Chewbacca commands.

preclean

Error Correction

Note: This functionality is still untested and can over correct legitimate variation .. autoclass:: preclean.Preclean_Command.Preclean_Command

members:

Assembling Sequences

class assemble.Assemble_Command.Assemble_Command(args_)[source]

Assembles reads from two (forward and reverse) fastq files/directories. For a set of k forward read files, and k reverse read files, return k assembled files. Matching forward and reverse files should be identically named, except for a <forward>/<reverse> suffix that indicates the read orientation. The two suffix conventions below are supported. Choose ONE suffix style and stick to it! Mixed suffixes are not supported.

_forwards/_reverse
and
_R1/_R2
Inputs:
  • fastq file(s) with left reads
  • fastq file(s) with right reads
Outputs:
  • fastq File(s) with assembled reads
Notes:
  • Choose ONE suffix style and stick to it! Mixed suffixes are not supported. e.g. Sample_100_forwards.fq and Sample_100_reverse.fq will be assembled into Sample_100_assembled.fq. Simmilarly, Sample_100_R1.fq and Sample_100_R2.fq will be assembled into Sample_100_assembled.fq. However, Sample_100_forwards.fq and Sample_100_R2.fq are not guaranteed to be matched.
  • You can provide as many pairs of files as you wish as long as they follow exactly one of the above naming conventions. If a ‘name’ parameter is provided, it will be used as a filename (not path) prefix for all assembled sequence files.

Example

Assuming a forwards read file ‘Data_R1.fq’ and a reverse reads file ‘Data_R1.fq’,

./
    Data_R1.fq
    Data_R2.fq

$ python chewbacca.py assemble -n BALI -f Data_R1.fq  -r Data_R2.fq  -o rslt

rslt/
    BALI_DATA.assembled.fq
default_program

alias of Assemble_Program_Pear

Demultiplexing by Barcode

class demux.Demux_Barcode_Command.Demux_Barcode_Command(args_)[source]
Given a set of files, each file is assigned a file offeset (value between sampleId and sequenceId). Each file is then split into separate child files where
each file holds only sequences belonging to a single sample. These child files are named using the sample name for the sequences it lists, and the file offset of the file it came from. Demuxing is based on the nucleotide barcode prefixing each sequence.
Inputs:
  • One or more fasta/fastq files to demux.
  • A single .barcodes file: A .barcodes.
Outputs:
  • <sample_name>_<file_id#>_ demux.<ext> file(s) - <fasta/fastq> files, containing all the sequences from file <file_id#>, which had a barcode corresponding to sample <sample_name>.
  • unmatched_<file_id#>_ demux.<ext> file(s) - <fasta/fastq> files, containing sequences from file <file_id#>, whose barcode did not match any of those listed in the .barcodes file.
Notes:
  • The assignment of the offset to file should be treated as an arbitrary process and should not used for record keeping.
  • Each input file will generate its own unmatched_* file (if applicable).

Example:

data/
    Data1.fasta:
        @Seq4
        AGACGCAAAAAA
        @Seq5
        AGTGTAAAAAAT


    Data2.fasta:
        @Seq6
        AGACGCAAAAAC
        @Seq7
        AGTGTAAAAAAG
        @Seq8
        CGTGTAAAAAAG
./
    Data.barcodes:
        SampleA        AGACGC
        SampleB        AGTGTA

$ python chewbacca.py demux_samples -i data/ -b Data.barcodes -o rslt

Here, we see that Data1.fasta was assigned ‘0’ as an offset, while Data2.fasta was assigned ‘1’ as an offset. Because both files had sequences from SampleA, the sequences from Data1.fasta were written to SampleA_0_demux.fastq, and those sequences from Data2.fasta were written to SampleA_1_demux.fastq. The same is true for SampleB.

rslt/
    SampleA_0_demux.fastq:
        @Seq4
        AGACGCAAAAAA

    SampleB_0_demux.fastq:
        @Seq5
        AGTGTAAAAAAT

    SampleA_1_demux.fastq:
        @Seq6
        AGACGCAAAAAC

    SampleB_1_demux.fastq:
        @Seq7
        AGTGTAAAAAAG

rslt_aux/
    unmatched_0_demux.fastq:
        @Seq8
        CGTGTAAAAAAG
default_program

alias of Demux_Program_Fastx

Demultiplexing by Name

class demux.Demux_Name_Command.Demux_Name_Command(args_)[source]
Given a set of files, each file is assigned a file offset. Each file is then split into separate child files where
each file holds only sequences belonging to a single sample. These child files are named using the sample name for the sequences it lists, and the file offset of the file it came from. Demuxing is based on unique sample names contained in sequence names.
Inputs:
  • One or more fasta/fastq files to demux. Sequences in these files should contain as a prefix the sample they came from. (This is untested)
  • A single .barcodes file: A .barcodes, listing samples as they appear in sequence names, but actual barcode sequences can be made up. This command will only make use of barcode names.
Outputs:
  • <sample_name>_<offset>_ demux.<ext> file(s) - <fasta/fastq> files, containing all the sequences from file <file_id#>, which had a sequence name containing sample <sample_name>.
  • unmatched_<offset>_ demux.<ext> file(s) - <fasta/fastq> files, containing sequences from file <file_id#>, whose barcode did not match any of those listed in the .barcodes file.
Notes:
  • The assignment of offset to file should be treated as an arbitrary process and should not used for record keeping.
  • Each input file will generate its own unmatched_* file (if applicable).

Example:

data/
    Data1.fasta:
        @SampleA:001
        AAAAAAAAAAAA
        @SampleAA:002
        AAAAAAAAAAAT
        @SampleA1:003
        AAAAAAAAAAAC
        @Sample_B:001
        AAAAAAAAAAAG

    Data2.fasta:
        @SampleAA:001
        GAAAAAAAAAAA
        @SampleA:002
        TAAAAAAAAAAA
        @Seq8
        CAAAAAAAAAAA
./
    Data.barcodes:
        SampleA         AAA
        SampleAA        AAA
        Sample_B        AAA

$ python chewbacca.py demux_names -i data/ -b Data.barcodes -o rslt

Here, we see that Data1.fasta was assigned ‘0’ as an offset, while Data2.fasta was assigned ‘1’ as an offset. Because both files had sequences from SampleA, the sequences from Data1.fasta were written to SampleA_0_demux.fastq, and those sequences from Data2.fasta were written to SampleA_1_demux.fastq. The same is true for SampleB.

rslt/
    SampleA_0_demux.fastq:
        @SampleA:001
        AAAAAAAAAAAA
        @SampleA1:003
        AAAAAAAAAAAC

    SampleAA_0_demux.fastq:
        @SampleAA:002
        AAAAAAAAAAAT

    SampleB_0_demux.fastq:
        @Sample_B:001
        AAAAAAAAAAAG

    SampleA_1_demux.fastq:
        @SampleA:002
        TAAAAAAAAAAA

    SampleAA_1_demux.fastq:
        @SampleAA:001
        GAAAAAAAAAAA

rslt_aux/
    unmatched_1_demux.fastq:
        @Seq8
        CGTGTAAAAAAG
default_program

alias of Demux_Program_Chewbacca

Sequence Renaming

class rename.Rename_Command.Rename_Command(args_)[source]

Renames sequences in a file with their sampleID and a serial ID#. Useful for simplifying complex naming systems into human-readable sequence names. In order to ensure the correct sample names are preserved, it is reccomended that this command be run immediately after the Demux Command.

Inputs:
  • A single fasta/fastq file or a directory containing multiple fasta/fastq files.
Outputs:
  • _renamed.<ext> file - A <fasta/fastq> file with the renamed sequences.
  • .samples file - A .samples.
  • .mapping file - A .mapping.
Notes:
  • In order for the .samples file to correctly list the sample name of the sequences in a file, this command should be run immediately after the Demux Command.
  • The –clip parameter tells Chewbacca that trailing _<offset numebr> (from the demuxing command) should not be considered part of the sample name when naming sequences. By default this is set to True, and should be fine. If you notice parts of your sample names getting clipped off in your .samples file, you should explicitly set this parameter to False.
  • Each input file will have a corresponding .samples, .mapping, and _renamed file.
  • The .samples file is needed by downstream Chewbacca processes (Building the OTU Table).
  • The .mapping file is purely for user convenience and record-keeping.

Example:

SampleA_0.fasta:
    @M03292:26:000000000-AH6AG:1:1101:22127:1256
    AAAA
    @M03292:26:000000000-AH6AG:1:1101:22127:1257
    AAAT

$ python chewbacca.py rename -i SampleA_0.fasta -o rslt

rslt/SampleA_0_renamed.fasta:
    @SampleA_ID0
    AAAA
    @SampleA_ID1
    AAAT

rslt_samples/SampleA_0_renamed.samples:
    SampleA_ID0 SampleA
    SampleA_ID1 SampleA

rslt_aux/SampleA_0_renamed.mapping:
    M03292:26:000000000-AH6AG:1:1101:22127:1256 SampleA_ID0
    M03292:26:000000000-AH6AG:1:1101:22127:1257 SampleA_ID1

Adapter Removal

class clean.Clean_Adapters_Command.Clean_Adapters_Command(args_)[source]

Removes sequencing adapters (and preceeding barcodes) from sequences in input file(s). Sequences should be in the following format:

<BARCODE><ADAPTER><SEQUENCE><RC_ADAPTER>.

Valid ADAPTER sequences, and their reverse-complements (ADAPTER_RC) should be defined separately in a pair of fasta-formatted files. Sequences passed to this command should have already been demultiplexed, as this process will remove the identifying barcode sequences.

Inputs:
  • One or more fasta/fastq files to clean.
  • A single .adapters file
  • A single .adapters_RC file
Outputs:
  • <filename>_debarcoded.<ext> file(s) - <fasta/fastq> files, containing sequences with their leading adapters, trailing adapters, and barcodes removed.
Notes:
  • Be aware of the program-specific details around ‘N’ nucleotide characters.

Example:

Given Data_ID#1 with barcode=AGACGC:

./
    Data.fasta:
        @Data_ID#1
        AGACGCGGWACWGGWTGAACWGTWTAYCCYCCATCGATCGATCGTGRTTYTTYGGNCAYCCNGARGTNTA

    Data.adapters:
        >1
        GGWACWGGWTGAACWGTWTAYCCYCC

    Data.adaptersRC:
        >first
        TGRTTYTTYGGNCAYCCNGARGTNTA

$ python chewbacca.py trim_adapters  -i Data.fasta -o rslt -a Data.adapters -arc Data.adapters_RC

rslt/
    Data_debarcoded.fastq:
        @Data_ID#1
        CATCGATCGATCG
default_program

alias of Clean_Adapters_Program_Flexbar

Quality Cleaning

class clean.Clean_Quality_Command.Clean_Quality_Command(args_)[source]

Removes regions of low quality from fastq-formatted reads. These regions are likely sources of error, and would be detrimental to other analytical processes. Input sequences to this command should have already been demultiplexed, and had their barcodes/adapters removed. Otherwise, the partial removal of these markers would leave behind invalid partial fragments that would be difficult to detect demux or trim form barcode.

Inputs:
  • One or more fastq files to clean.
Outputs:
  • <filename>_cleaned.fastq file(s) - Fastq files, containing sequences with areas of low quality removed.
Notes:
  • Be aware of the program-specific details around ‘N’ nucleotide characters.
  • Be aware of the program-specific defaults for minimum surviving sequence lengths.

Example:

./
    Data.fasta:
        @Data_ID#1
        GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCTTTACAG
        +
        !zzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzz%%%zzzz

The command below asks Chewbacca to trim away any section of length 3 NT in Data_ID#1 that has quality lower than 20, keeping the longer of the remaining ends. If the remaining sequence at the end of this process is shorter than 15 NT, discard the whole sequence (these values are chosen for illustrative purposes).

$ python chewbacca.py clean_seqs -i Data.fasta -o rslt -m 15 -w 3 -q 20

Note that the ‘TTT’ subsequence has been cut, because its average quality (5) is less than the threshold (20). After this cut, the longest remaining subsequence (the subsequence to the left of the cut) was kept, and the shorter subsequence (to the right of the cut) was discarded. Because the final sequence is longer than 15NT, it is kept and written to the output file.

rslt/
    Data_cleaned.fastq:
        @Data_ID#1
        GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTC
        +
        !zzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzz
default_program

alias of Clean_Quality_Program_Trimmomatic

File Conversion

class util.Convert_Fastq_Fasta_Command.Convert_Fastq_Fasta_Command(args_)[source]

Converts a Fastq-formatted file to a Fasta-formatted file. Useful for reducing data size and preparing for fasta-only operations.

Inputs:
  • A fastq file or a director conataining multiple fastq files .
Outputs:
  • <filename>.fasta file(s) - Converted fasta files.

Example:

./
    Data.fastq:
        @Data_ID#1
        AGACGCGGWACWGGWTGAACWGTWTAYCCYCCATCGATCGATCGTGRTTYTTYGGNCAYCCNGARGTNTA

``$ python chewbacca.py trim_adapters -i Data.fasta -o rslt ``

rslt/
    Data.fasta:
        >Data_ID#1
        AGACGCGGWACWGGWTGAACWGTWTAYCCYCCATCGATCGATCGTGRTTYTTYGGNCAYCCNGARGTNTA

Dereplication

class dereplicate.Dereplicate_Command.Dereplicate_Command(args_)[source]

Dereplicates a fasta file by grouping identical reads together under one representative sequence. The number of duplicate/seed sequences each representative sequence represents is given by a ‘replication count’ at the end of the sequence name in output fasta file. If a .groups file is provided, then previous replication counts will be take in into account (e.g. Imagine a representative sequence X that represents 3 sequences. If X is found to be

identical to Y (no a seed for any other sequence) then the new cardinality, or replication count, of X becomes 4. Cardinality are denoted with a

suffix of ‘_K’ on the sequence name, where K is the cardinality for the group that sequence represents.

Inputs:
  • One or more fasta files to dereplicate.
  • Optional: .groups - A list of representative names and the names of their seed sequences. You likely have one of these files if you’ve previously run a clustering or dereplication command.
Outputs:
  • _counts.fasta file - A fasta file with unique sequences and their replication counts.
  • _derep:ref:.groups - A list of representative names and the names of their seed sequences.
Notes:
  • This command only dereplicates within each fasta file (not across all files). This means a sequence in one file will be unique within that file, but might exist in another file. To ensure sequences are uniqe across an entire dataset, merge all fasta files into one file, then dereplicate that fasta file. It the fasta files each have group files, then make sure you merge those as well.
  • Each input file will generate a corresponding _count file.
  • If an input .groups file is not provided, then each input fasta file will generate a new groups file named <file_name>_derep.groups. If an input .groups file IS provided, then a single groups file named ‘dereplicated_updated.groups’ will be generated.
  • The output .groups file is needed by downstream Chewbacca processes (Dereplication, Clustering, Building the OTU Table).
  • The order of sequence names in the *_counts.fasta and .groups file is arbitrary.

Example:

./
    Data.fasta
        >seq1
        AAA
        >seq2
        AAA
        >seq3
        AAAG
        >seq4_3
        AAAGT
        >seq7
        AAAGT

    test.groups
        seq4    seq4 seq5 seq6

In the above example, test.groups indicates that seq4 is a sequence that has previously been identified as a representative (in some earlier round of clustering or dereplication).

$ python chewbacca.py dereplicate_fasta -i Data.fasta -o rslt -g test.groups

rslt/Data_counts.fasta:
    >seq4_4
    AAAGT
    >seq1_2
    AAA
    >seq3_1
    AAAG

rslt_groups_files/*.groups:
    seq3        seq3
    seq1        seq2 seq1
    seq4        seq7 seq6 seq5 seq4

Notice that Data_counts.fasta lists the unique sequences from Data.fasta, and their replication counts. Also notice that seq4 had previous replication data (stored in the Data.groups file).

default_program

alias of Dereplicate_Program_Vsearch

File Splitting

class util.Partition_Command.Partition_Command(args_)[source]

A utility command that partitions a fasta/fastq file into a set of files (of the same file format), with a user-specified (maximum) number of sequences per file. Allows users to partition a large file into segments, and perform discrete operations in run_parallel over those segments.

Inputs:
  • One or more fasta/fastq files to partition.
  • C: An integer defining the maximum number of sequences per file
Outputs:
  • <filename>_part_<part_#>.<ext> file(s) - <fasta/fastq> files, with at most C sequences per file.

Example:

./
    Data.fq:
        @Data_ID1
        GATTTGGGG
        +
        !zzzzzzzzz
        @Data_ID2
        GATTTGGGG
        +
        !zzzzzzzzz
        @Data_ID3
        GATTTGGGG
        +
        !zzzzzzzzz

$ python chewbacca.py convert_fastq_to_fasta -i Data.fq -o rslt/

rslt/
    Data.fasta:
        @Data_ID1
        GATTTGGGG
        @Data_ID2
        GATTTGGGG
        @Data_ID3
        GATTTGGGG

File Merging

class util.Merge_Command.Merge_Command(args_)[source]

Concatenates multiple files into a single file. Useful for combining the results of a run_parallel operation, or when preparing for cross-sample derepication.

Inputs:
  • A set of files to merge.
  • An <output_filename>.
  • An <output_prefix>.
Outputs:
  • <output_filename>.<output_prefix> - A file consisting of all the input files concatenated together.
Notes:
  • The order of the content in the concatenated files is not guaranteed.

Example:

targets/
    Data.fq:
        @Data_ID1
        GATTTGGGG
        +
        !zzzzzzzzz

    Data2.fa:
        @Data_ID1
        GATTTGGGG

    Blah.txt
        Hello World!

$ python chewbacca.py merge_files -i targets/ -o rslt/ -f txt -n Merged

rslt/
    Merged.txt:
        Hello World!
        @Data_ID1
        GATTTGGGG
        +
        !zzzzzzzzz
        @Data_ID1
        GATTTGGGG

File Cleaning

class util.Ungap_Command.Ungap_Command(args_)[source]

Removes target characters from a fasta/fastq file. Useful for removing gap characters from sequence alignments.

Inputs:
  • One or more fasta/fastq files to clean.
  • A string of one or more gap characters to remove.
Outputs:
  • *_cleaned.<ext> file - A <fasta/fastq> file with gap characters removed from its sequences.

Example:

Data.fasta:
    >seq1
    AAAAA.A*A-A

$ python chewbacca.py ungap_fasta -i Data.fasta -o rslt -f fasta -g ".*-"

rslt/Data.fasta:
    >seq1
    AAAAAAAA

Deep Cleaning

class clean.Clean_Deep_Command.Clean_Deep_Command(args_)[source]

Performs an intensive deep-cleaning of sequences to eliminate frameshifts, detect chimeras, and determine sequence orientation. Input files to this command should first be dereplicated. Doing so will reduce the total number of alignments required, and reduce computation time.

Inputs:
  • One or more fasta/fastq files to deep clean (nucleotide sequences).
  • One reference fasta (nucleotide sequences).
Outputs:
  • *_AA - Amino Acid Alignment file, including reference sequences.
  • *_log.csv - A log listing each input sequence, and deep cleaning results for each sequence.
  • *_NT - Nucleotide Alignment file, including reference sequences.
Notes:
  • Sequences that do not meet quality cleaning standards are dropped.
  • The output files contain reference sequences, and odd alignment characters. Both of these need to be removed by running the Clean_Deep_Repair Command.

Example:

Data.fasta
BIOCODE.fa

$ python chewbacca.py macseAlign -i Data.fasta -o rslt -d BIOCODE.fa

rslt/Data_AA
rslt/Data_NT
rslt/Data_log.csv
default_program

alias of Clean_Deep_Program_Macse

Deep Cleaning Repair

class clean.Clean_Deep_Repair_Command.Clean_Deep_Repair_Command(args_)[source]
Cleans aligned files by removing gap characters and reference sequences from the file. Sequences passed to this
command should have previously been aligned.
Inputs:
  • *_AA - Amino Acid Alignment file, including reference sequences.
  • *_log.csv - A log listing each input sequence, and deep cleaning results for each sequence.
  • *_NT - Nucleotide Alignment file, including reference sequences.
  • Nucleotide reference fasta.
  • * The original fasta files that were passed in to the Clean_Deep Command
  • * The Nucleotide reference fasta that was passed to the Clean_Deep Command
Outputs:
  • *_MERGED.fasta - A clean fasta file with all the surviving sequences from deep cleaning.
Notes:
  • A single *_MERGED.fasta is generated regardless of the number of input files.

Example:

BIOCODE.fa

originalData/Data.fasta

input/
    Data_AA
    Data_NT
    Data_log.csv

$ python chewbacca.py -i input/ -o out/ -d  BIOCODE.fa -s originalData/

out/
    MACSE_OUT_MERGED.fasta
default_program

alias of Clean_Deep_Repair_Program_Macse

Sequence Clustering

class cluster.Cluster_Command.Cluster_Command(args_)[source]

Clusters a set of fasta files. This command generates a fasta file of unique sequences (each representing a cluster) and a .groups file. This command also takes an optional .groups file containing replication data from previous commands. If a .groups file is supplied, only one output .groups file is generated (regardless of the number of inputs).

Inputs:
  • One or more fasta files to cluster.
  • Optional: .groups - A list of representative names and the names of their seed sequences. You likely have one of these files if you’ve previously run a clustering or dereplication command.
Outputs:
  • *.fasta file - A fasta file with unique sequences and their replication counts.
  • *.groups - A .groups
Notes:
  • The input fasta file(s) should have been dereplicated before clustering. * For a single experiment with multiple fasta files, it is best to merge all input fasta files, dereplicate them, then cluster the single merged and dereplicated fasta file. This provides the best OTU groupings.

Example:

./
    Data.fasta:
        >seq1_3
        AAAAAAAAAA
        >seq2_1
        ATAAAAAAAA
        >seq3_1
        TTTTTTTTTT
        >seq4_1
        TTTTTTATTT
        >seq5_1
        TTTTTTATCT


    Data.groups:
        seq1    seq6 seq1 seq7

$ python chewbacca.py cluster_seqs -i Data.fasta -o rslt -g Data.groups

rslt/
    Data_clustered_seeds.fasta:
        >seq1_4
        AAAAAAAAAA
        >seq3_3
        TTTTTTTTTT

rslt_groups_files/
    postcluster_updated.groups:
        seq3    seq3 seq5 seq4
        seq1    seq2 seq1 seq7 seq6

OTU Table Construction

class otu.Build_OTU_Table_Command.Build_OTU_Table_Command(args_)[source]

Builds an OTU table using a .groups, .samples, and .barcodes file. The OTU table shows OTU (group) abundance by sample.

Inputs:
  • One or more .samples.
  • One or more .barcodes.
  • one or more .groups.
Outputs:
  • matrix.txt - A tab-delimited table mapping OTUs (groups) to their abundance in each sample.
Notes:
  • A sequence name may not appear in more than one group file (or more than one line in a gropus file for that matter!).

Example:

./
    test.barcodes
        Sample1 aaaaaa
        Sample2 aaaaat
        Sample3 aaaaac
        Sample4 aaaaag

    test.groups
        seq3    seq3 seq5 seq4
        seq1    seq2 seq1 seq7 seq6

    test.samples
        seq1    Sample1
        seq2    Sample1
        seq3    Sample1
        seq4    Sample2
        seq5    Sample2
        seq6    Sample2
        seq7    Sample3

$ python chewbacca.py build_matrix -b test.barcodes -g  test.groups -s test.samples -o rslt/

rslt/
    matrix.txt
        OTU     Sample1 Sample2 Sample3 Sample4
        seq3    1       2       0       0
        seq1    2       1       1       0

OTU Identification

class otu.Query_OTU_DB_Command.Query_OTU_DB_Command(args_)[source]

Aligns sequences in a fasta file against those in a reference database in order to determine OTU identity.

Only alignment based identification using vsearch is currenty available

Inputs:
  • One or more fasta files containing sequences to identify.
  • A curated fasta file of high quality sequences and known species.
  • A database containing taxonomic identifiers for sequences in the curated fasta file.
Outputs:
  • A .tax.
Notes:
  • The files COI.fasta and ncbi.db are included in the Chewbacca Docker distributions.

Example:

~/ARMS/refs/

    COI.fasta # A precompiled fasta file of COI data from NCBI.
        >94483305
        AGGACGGATCAGACGAAGAGGGGCGTTTGGTATTGGGTTATGGCAGGGGGTTTTATATTGATAATTGTTGTGATGAAATT
        GATGGCCCCTAAGATAGAGGAGACACCTGCTAGGTGTAAGGAGAAGATGGTTAGGTCTACGGAGGCTCCAGGGTGGGAGT

    ncbi.db # A precompiled database of (Taxa) for the entries in 'COI.fasta'.


data/
    Data.fasta:
        >seq1
        GAATAGGTGTTGGTATAGAATGGGGTCTCCTCCTCCGGCGGGGTCGAAGAAGGTGGTGTTGAGGTTGCGGTCTGTTAGTAGTATAGTGATGCCAGCAG
        CTAGGACTGGGAGAGATAGGAGAAGTAGGACTGCTGTGATTAGGACGGATCAGACGAAGAGGGGCGTTTGGTATTGGGTTATGGCAGGGGGTTTTATA
        TTGATAATTGTTGTGAGGAAATTGATGGCCCCTAAGATAGAGGAGACACCTGCTAGGTGTAAGGAGAAGATGGTTAGGTCTACGGAGGCTCCAGGGTG
        GGAGTAGTTCCCTGCTAA

$ python chewbacca.py query_db -i Data.fasta -o out/ -r ~/ARMS/refs/COI.fasta -d ~/ARMS/refs/ncbi.db

rslt/
    Data_result.out
        seq1    94483305        99.4    173     55.4    Chordata:Mammalia:Primates:Hominidae:Homo:Homo sapiens
class otu.Query_OTU_Fasta_Command.Query_OTU_Fasta_Command(args_)[source]

Aligns sequences in a fasta file against those in a reference fasta in order to determine OTU identity.

Inputs:
  • One or more fasta files containing sequences to identify.
  • A curated fasta file of high quality sequences and known species.
  • A two-column, tab-delimited text file mapping sequence names in the curated fasta file to taxonomic identifiers.
Outputs:
  • A .tax.
Notes:
  • The files ‘bold.fna’ and ‘seq_lin.mapping’ are included in the Chewbacca Docker distributions.

Example:

~/ARMS/data/
    bold.fna # A precompiled fasta file of data from BOLD.
        >GBMAA1117-14
        GGGCTTTTGCGGGTATGATAGGAACAGCATTTAGTATGCTTATTAGGTTAGAACTATCTTCCCCAGGGTCTATGTTAGGAGATGATCATTTATATAAT
        GTTATAGTAACAGCTCATGCATTTGTAATGATATTTTTTTTAGTTATGCCAGTAATGATTGGGGGTTTTGGTAATTGGTTAGTACCTTTATATATTGG
        TGCCCCGGATATGGCTTTTCCTAGATTAAATAATATTAGTTTTTGGTTATTACCTCCGGCGCTTACTTTATTATTAGGTTCGGCTTTTGTAGAACAAG
        GGGCTGGGACAGGTTGGACAGTTTATCCGCCTTTATTTAGTATTCAAACTCATTCTGGGGGGTCTGTGGATATGGTAATATTTAGTTTACATTTAGCT
        GGAATATCTTCTATATTAGGGGCTATGAATTTTATAACAACAATCTTTAATATGAGGTCTCCGGGAGTAACTATGGATAGAATGCCTTTATTTGTTTG
        ATCTGTTTTAGTAACTGCTTTTTTATTATTATTATCATTGCCAGTATTAGCTGGTGCCATAACAAGTCTTTTAACCGATCGAGATTTTAATACTACAT
        TT

    seq_lin.mapping # A precompiled two-column tab file of (Taxa) for the entries in 'bold.fna'.
        GBMAA1117-14    Animalia;Porifera;Demospongiae;Haplosclerida;Phloeodictyidae;;Calyx;Calyx podatypa

./
    Data.fasta:
        >seq1
        ACTATCAGGCATTCAAGCCCATTCAGGGGGAGCAGTAGATATGGCTATATTTAGTCTACATCTAGCTGGTGTATCCTCTATTTTAAGTTCTATAAACT
        TTATAACTACTATAATTAATATGAGGGTTCCTGGGATGAGTATGCATAGATTACCTCTATTCGTATGGTCTGTATTAGTTACTACAATATTATTGTTG
        TTATCTTTACCAGTATTAGCTGGTGGAATTACAATGTTATTGACAGATAGAAATTTTAATACAACATTCTTTGACCCTGCGGGAGGAGGAGATCCTAT
        TTTATTCCAGCACTTATTT

$ python chewbacca.py query_fasta -i Data.fasta -o rslt -r ~/ARMS/data/bold.fna -x ~/ARMS/data/seq_lin.mapping

rslt/
        Data_result.out
            seq1        GBMAA1117-14    90.6    265     84.7 Animalia;Porifera;Demospongiae;Haplosclerida;Phloeodictyidae;;Calyx;Calyx podatypa

OTU Annotation

class otu.Annotate_OTU_Table_Command.Annotate_OTU_Table_Command(args_)[source]

Annotates an OTU table with taxonomic names by replacing sequence names in the OTU table with their identified taxonomies. Multiple OTU can annotated with the same taxonomic name – those are not combined.

Inputs:
  • An OTU_table to annotate.
  • One or more .tax files to read annotations from.
Outputs:
  • An OTU_table with sequence names replaced by taxonomic names in the input .tax file.
Notes:
  • The input annotation file(s) should list only one identification per sequence name. If you find more than one taxonomic identity for a sequence, choose only one to include in the input .tax file(s).

Example:

./
    matrix.txt
        OTU     Sample1 Sample2 Sample3 Sample4
        seq3    1       2       0       0
        seq1    2       1       1       0

    data.tax:
        seq1    94483305        99.4    173     55.4    Chordata:Mammalia:Primates:Hominidae:Homo:Homo sapiens

$ python chewbacca.py annotate_matrix -i matrix.txt -a data.tax -o rslt

rslt/
    matrix.txt
        OTU     Sample1 Sample2 Sample3 Sample4
        seq3    1       2       0       0
        Chordata:Mammalia:Primates:Hominidae:Homo:Homo sapiens  2       1       1       0