In order for this training to be effective for all students participating, each student should perform the following before the first class:
- Confirm that they can connect to their VM.
- Watch a 75 minute video on the linux operating system.
- Watch a short video on tmux, a linux application that allows for multiple terminals.
Connect to your VM
In the slack channel will be listed your VM IP address, along with a ssh key file. This information will be posted on Monday, June 29th. In order to connect to this VM, you will need to download the terminal program MobaXterm. MobaXterm can be downloaded from https://mobaxterm.mobatek.net/download-home-edition.html. IMPORTANT: Please download the “Portable edition”, not the “Installer edition”. The “Portable edition” will allow you to run the application without installing it, thus no admin rights needed. Once you have downloaded MobaXterm and have access to your IP address and ssh key, watch the youtube video on How to Connect to Your VM. I will also post the username and passphrase in slack.
The video link is https://www.youtube.com/watch?v=oxuRxtrO2Ag. This video will give you a basic understanding of navigating around the bash terminal in linux. It is well worth the 75 minutes before the course, working within your own VM.
Many of the processes we will be running will require long processing times that could be interrupted by network connection issues. To avoid this, we will perform the exercises in the class using the tmux application. This tmux tutorial video will guide you through the process of using tmux.
Some additional resources that you will find useful when starting out are:
Session 1 - Background on Viral Genomics and Coronaviruses - Tuesday, Sept 8th, 12 PM and 3 PM EDT
Session 2 - Sequencing Methods for SARS-CoV-2 - Wednesday, Sept 9th, 11 AM and 2 PM EDT
Session 3 - Linux - Friday, Sept 11th, 11 AM and 2 PM EDT
Session 4 - StaPH-B Toolkit and Cecret - Monday Sept 14th, 11 AM and 2 PM EDT
Session 5 - Terra.bio - Wednesday Sept 16th, 11 AM and 2 PM EDT
Session 6 - Terra.bio - Friday Sept 18th, 11 AM and 2 PM EDT
Session 7 - NGS Data Visualization for QC of Results - Monday Sept 21st, 11 AM and 2 PM EDT
Session 8 - Data Sharing with GISAID and NCBI SRA - Wednesday Sept 23rd, 11 AM and 2 PM EDT
Session 9 - Data Sharing with NCBI Genbank - Friday Sept 25th, 11 AM and 2 PM EDT
Office hours will be offered each week on Tuesday from 2 PM to 4 PM EDT and on Thursday from 12 PM to 2 PM EDT, and by request.
Session 1 - Background on Viral Genomics and Coronaviruses
- Register for a GISAID account
- Viral genomics primer
- Considerations of bacteria vs viral pathogens
- Coronavirus, the new flu.
- Papers on Coronaviruses
Session 2 - Sequencing Methods for SARS-CoV-2
- Metagenomics, enrichment, amplicon
- The ARTIC protocol
- ARTIC + Illumina DNA Flex - Part 1 at protocols.io.
- ARTIC + Illumina DNA Flex - Part 2 at protocols.io.
Session 3 - Linux
- Linux basics
- Working in Tmux
- Connecting to Basespace
- Connecting to cloud resources
- Transferring data to your VM for the class
Commands used in this session
***bash commands*** dir #lists out a directory ls #also lists out a directory ls -la #lists out a directroy with details ll #alias for ls -la cd <directory> #change directory cd #will bring you to your home directory mkdir <dir> #make directory rmdir <dir> #remove directory if it is empty rm -rf <dir> #remove directory if it is full gzip <name.fastq> #zip a read file ***tmux*** tmux ls #list open tmux sessions tmux new -s <name> #start a tmux session called <name> tmux a -t <name> #attach to tmux session called <name> tmux kill-session -t <name> #kill tmux session called <name> Ctrl-b , #rename current window Ctrl-b c #create new window Ctrl-b n #go to next window tmux info #help ***basespace cli*** bs auth #authenticate to Basespace bs list projects #list projects in basespace bs download project -n <name> -o <dir> #download project files to <dir> ***gsutils*** gsutil ls #list storage buckets gsutil cp <source> <destination> #cp data from storage bucket ***sra-toolkit*** - might need to "sudo apt-get install sra-toolkit" prefetch <sra_id> fastq-dump --split-files --gzip <sra_id>
Session 4 - StaPH-B Toolkit - Monroe
- Monroe pipeline
Some commands to make today’s session easier
***Configure Cecret*** tmux new -s session4 #starts a new tmux session if you don't have one already cd #moves you to your home directory rm -rf Cecret/ #removes old Cecret installation git clone https://github.com/UPHL-BioNGS/Cecret.git #clones new Cecret cd <workspace> #move to workspace directory ***let's copy over some files gsutil ls gsutil ls gs://mtn-class-bucket/ gsutil -m cp gs://mtn-class-bucket/cecret.tar . gsutil -m cp gs://mtn-class-bucket/monroe.tar . gsutil -m cp gs://mtn-class-bucket/Cecret.nf . ***move the Cecret.nf folder to the Cecret location*** mv Cecret.nf ~/Cecret/ ***unwrap tar files*** tar -xvf cecret.tar tar -xvf monroe.tar mkdir cecret_live mkdir monroe_live cd monroe cp 20-07-01_* ../monroe_live/ cp -r reads/ ../monroe_live/ #you can use optional SRA method at the end to copy into a reads folder cd .. cd monroe_live staphb-wf monroe pe_assembly --primers V3 --output pe_assembly_1 --config 20-07-01_pe_assembly.config reads ***Before we launch cluster_analysis, we need to go into the monroe_live/pe_assembly_1/assemblies/ directory ***and remove/modify the failed assemblies. This will be discussed in the video. Failure to do so will ***create incorrect results for the cluster_analysis step. ***example data screening*** mv SRR12542859_consensus.fasta SRR12542859_consensus.fasta.fail mv SRR12542860_consensus.fasta SRR12542860_consensus.fasta.fail mv SRR12542861_consensus.fasta SRR12542861_consensus.fasta.fail mv SRR12542863_consensus.fasta SRR12542863_consensus.fasta.fail mv SRR12542869_consensus.fasta SRR12542869_consensus.fasta.fail ***end data screening*** staphb-wf monroe cluster_analysis --output cluster_analysis_1 --config 20-07-01_cluster_analysis.config pe_assembly_1/assemblies/ ***If the data wasn't already copied over from the storage bucket, they ***following would download the data needed from SRA ***retrieving data from SRA*** prefetch SRR12542859 SRR12542860 SRR12542861 SRR12542862 SRR12542863 SRR12542864 SRR12542865 SRR12542866 SRR12542867 SRR12542868 SRR12542869 SRR12542870 fastq-dump --split-files --gzip SRR12542859 SRR12542860 SRR12542861 SRR12542862 SRR12542863 SRR12542864 SRR12542865 SRR12542866 SRR12542867 SRR12542868 SRR12542869 SRR12542870
Session 5 - Terra.bio Part 1
Session 6 - Terra.bio Part 2
Instrutions for working with WDL files on the command line.
# install miniwdl (maybe about 1 minute) pip install miniwdl # test miniwdl (also about 1 minute) miniwdl run_self_test # describe refbased assembly parameters miniwdl run https://raw.githubusercontent.com/broadinstitute/viral-pipelines/master/pipes/WDL/workflows/assemble_refbased.wdl # go get some input files gsutil -m cp gs://pathogen-public-dbs/refs/ARTIC_V3_nCoV-2019_NC_045512_primers3.bed . gsutil -m cp gs://pathogen-public-dbs/refs/ref-sarscov2-NC_045512.2.fasta . # fetch a bam from SRA (this takes about 1 minute or less) miniwdl run https://raw.githubusercontent.com/broadinstitute/viral-pipelines/master/pipes/WDL/workflows/fetch_sra_to_bam.wdl Fetch_SRA_to_BAM.SRA_ID=SRR12542859 # assemble that sample (this takes 7 mins) miniwdl run https://raw.githubusercontent.com/broadinstitute/viral-pipelines/master/pipes/WDL/workflows/assemble_refbased.wdl reads_unmapped_bams=/home/mtn-region/20200916_162418_fetch_sra_to_bam/out/reads_ubam/SRR12542859.bam reference_fasta=ref-sarscov2-NC_045512.2.fasta trim_coords_bed=ARTIC_V3_nCoV-2019_NC_045512_primers3.bed # Optional: querying json files with jq (instead of searching the above by eye) sudo apt-get install jq jq -r '.["assemble_refbased.align_to_ref_merged_reads_aligned"]' _LAST/outputs.json # returns 18914 jq -r '.["assemble_refbased.assembly_length_unambiguous"]' _LAST/outputs.json # returns 14031 jq -r '.["assemble_refbased.assembly_mean_coverage"]' _LAST/outputs.json # returns 57.09861886767214 jq -r '.["assemble_refbased.dist_to_ref_snps"]' _LAST/outputs.json # returns 12 jq -r '.["assemble_refbased.dist_to_ref_indels"]' _LAST/outputs.json # returns 0 # all of the above values are identical to what Terra reports! # serially run it on all 12 samples (total runtime is about 2 hours on a 1-core VM) for srr in SRR12542859 SRR12542860 SRR12542861 SRR12542862 SRR12542863 SRR12542864 SRR12542865 SRR12542866 SRR12542867 SRR12542868 SRR12542869 SRR12542870; do BAMFILE=$(miniwdl run https://raw.githubusercontent.com/broadinstitute/viral-pipelines/master/pipes/WDL/workflows/fetch_sra_to_bam.wdl Fetch_SRA_to_BAM.SRA_ID=$srr | jq -r '.outputs["fetch_sra_to_bam.reads_ubam"]') miniwdl run https://raw.githubusercontent.com/broadinstitute/viral-pipelines/master/pipes/WDL/workflows/assemble_refbased.wdl reads_unmapped_bams=$BAMFILE reference_fasta=ref-sarscov2-NC_045512.2.fasta trim_coords_bed=ARTIC_V3_nCoV-2019_NC_045512_primers3.bed done ## other notes about parallelization # Times listed above are for a 1-core VM. # With more CPU available, certain steps (e.g. minimap2) will auto parallelize and complete faster, but the overall speed will not increase linearly. # Also, with more CPU available, miniwdl will parallelize certain concurrent steps for a single assembly # If you have a big (30+ CPU) machine, you can replace the bash for loop with a GNU parallel invocation in order to parallelize across samples # miniwdl can submit its work across a Docker Swarm, if you set that up on your back end, but ultimately, will not have complex ways of parallelizing work across large amounts of resources. You should be switching to Cromwell (or Terra) when you get to that scale.
Session 7 - QC and Data Sharing: GISAID
The GISAID Initiative promotes the rapid sharing of data from all influenza viruses and the coronavirus causing COVID-19. This includes genetic sequence and related clinical and epidemiological data associated with human viruses, and geographical as well as species-specific data associated with avian and other animal viruses, to help researchers understand how viruses evolve and spread during epidemics and pandemics.
GISAID does so by overcoming disincentive hurdles and restrictions, which discourage or prevented sharing of virological data prior to formal publication.
The Initiative ensures that open access to data in GISAID is provided free-of-charge to all individuals that agreed to identify themselves and agreed to uphold the GISAID sharing mechanism governed through its Database Access Agreement.
All bonafide users with GISAID access credentials agreed to the basic premise of upholding a scientific etiquette, by acknowledging the Originating laboratories providing the specimens, and the Submitting laboratories generating sequence and other metadata, ensuring fair exploitation of results derived from the data, and that all users agree that no restrictions shall be attached to data submitted to GISAID, to promote collaboration among researchers on the basis of open sharing of data and respect for all rights and interests.
Session 8 - QC and Data Sharing: NCBI SRA
Session 9 - QC and Data Sharing: NCBI Genbank