Source code for dereplicate.Dereplicate_Command

from classes.ChewbaccaCommand import ChewbaccaCommand
from Dereplicate_Program_Vsearch import Dereplicate_Program_Vsearch


[docs]class Dereplicate_Command(ChewbaccaCommand):
    """Dereplicates a fasta file by grouping identical reads together under one representative sequence.  The
    number of duplicate/seed sequences each representative sequence represents is given by a 'replication count' at
    the end of
    the sequence name in output fasta file.  If a .groups file is provided, then previous replication counts will
    be take in into account (e.g. Imagine a representative sequence X that represents 3 sequences.  If X is found to be
     identical to Y (no a seed for any other sequence) then the new cardinality, or replication count, of X becomes 4.  Cardinality are denoted with a
    suffix of '_K' on the sequence name, where K is the cardinality for the group that sequence represents.


    **Inputs**:
        * One or more fasta files to dereplicate.
        * Optional: :ref:`.groups` - A list of representative names and the names of their seed \
                                            sequences.  You likely have one of these files if you've previously run a \
                                            clustering or dereplication command.

    **Outputs**:
        * _counts.fasta file - A fasta file with unique sequences and their replication counts.
        * _derep:ref:`.groups` - A list of representative names and the names of their seed \
                                            sequences.

    **Notes**:
        * This command only dereplicates within each fasta file (not across all files). \
            This means a sequence in one file will be unique within that file, but might exist in another file. \
            To ensure sequences are uniqe across an entire dataset, merge all fasta files into one file, then \
            dereplicate that fasta file. It the fasta files each have group files, then make sure you merge those as well.


        * Each input file will generate a corresponding _count file.
        * If an input .groups file is not provided, then each input fasta file will generate a new groups file named \
            <file_name>_derep.groups.  If an input .groups file IS provided, then a single groups file named \
            'dereplicated_updated.groups' will be generated.
        * The output .groups file is needed by downstream Chewbacca processes (Dereplication, Clustering,\
                Building the OTU Table).
        * The order of sequence names in the \*_counts.fasta and .groups file is arbitrary.

    **Example**:

    ::

        ./
            Data.fasta
                >seq1
                AAA
                >seq2
                AAA
                >seq3
                AAAG
                >seq4_3
                AAAGT
                >seq7
                AAAGT

            test.groups
                seq4	seq4 seq5 seq6

    In the above example, test.groups indicates that seq4 is a sequence that has previously been identified as a
    representative (in some earlier round of clustering or dereplication).

    ``$ python chewbacca.py dereplicate_fasta -i Data.fasta -o rslt -g test.groups``

    ::

        rslt/Data_counts.fasta:
            >seq4_4
            AAAGT
            >seq1_2
            AAA
            >seq3_1
            AAAG

        rslt_groups_files/*.groups:
            seq3	seq3
            seq1	seq2 seq1
            seq4	seq7 seq6 seq5 seq4

    Notice that Data_counts.fasta lists the unique sequences from Data.fasta, and their replication counts.  Also
    notice that seq4 had previous replication data (stored in the Data.groups file).
    """
    supported_programs = [Dereplicate_Program_Vsearch]
    default_program = Dereplicate_Program_Vsearch
    command_name = "Dereplicate"