Training courses 2018: Linux shell scripting for high-throughput biological data processing on supercomputers
IMPORTANT DATES for this Training course:
- Deadline for applications: 20 Dec 2017 [Full]
- Course date: 16-17 Jan 2018
VENUE:
CINECA, Via dei Tizii 6b, 00185 Roma, Italy.
FEE: The course does not include a fee, but the participants are expected to pay their own travel, meals and hotel costs (if any).
A maximum of 20 candidates will be selected based on their need for the course as emerging from the application form. Notifications of acceptance will be sent shortly after the application until we reach 20 participants. Priority will be given to candidates from ELIXIR-IIB member institutions (see the list at the bottom) and ELIXIR nodes.
Cancellation policy: Attendance is limited to 20 participants. We expect many more applications. Accepted participants commit to attend the course for its whole duration. Failure to attend training sessions is disruptive. Moreover, it blocks other candidates from participating. Therefore a cancellation policy is in place so that only written requests presented two weeks in advance relatively to the course starting date are accepted.
Instructors and Organisers
- Allegra Via - ELIXIR-IIB Training Coordinator, IBPM-CNR, IT
- Loredana Le Pera - ELIXIR-IIB Training Team, IBIOM-CNR, IT
- Tiziana Castrignanò - SCAI Department, CINECA, Roma, IT
Helpers
- Tiziano Flati - ELIXIR-IIB IBIOM-CNR and SCAI Department, CINECA, Roma, IT
- Silvia Gioiosa - ELIXIR-IIB IBIOM-CNR and SCAI Department, CINECA, Roma, IT
Course Description
An unprecedented amount of biomedical data have been produced and stored in the last years. Managing such biological big data is often not affordable without high-performance computing architectures, needed to analyze and process large-scale datasets. Running high-throughput (HTP) bioinformatics data pipelines on supercomputing machines requires advanced Linux shell command line and scripting skills. Most scientists working with such data often lack such skills or have acquired them by self-learning without becoming fully independent and fluent. This may have repercussions on the quality, reproducibility, and reliability of the analyses. In this two-day course, we will introduce the Linux shell and, on day one, we will show how to navigate and work with files and directories, how to combine commands to do new things, how to perform the same actions on many different files, how to filter and selectively extract data from tables, and how to find objects in files. Moreover, we will show how to connect to a remote supercomputer and how to utilise a supercomputing environment to analyse big amount of biological data, run simple shell scripts and bioinformatics pipelines. Day 2 will be wholly practical. Participants are invited to let us know in advance which are the typical file format(s) they have to deal with (e.g. fastq, table, etc), the typical processes they need to perform on them (e.g. filtering, ordering, etc.) and the typical programs they need to run (e.g. bwa, hisat2, etc.) so that we can prepare tailored practicals. Participants are also welcome to come to the course with one or more files they wish to work with, provided they do not exceed a given size.
Target audience
This course is aimed at scientists at any stage of their career who work with big data files and/or large numbers of files, and need to process and analyse their data on local or remote supercomputing machines, but lack the Linux shell command line and scripting skills necessary to perform such tasks.
Learning objectives
Learn Linux commands to navigate and work with files and directories
Learn how to combine commands using pipes
Learn input and output redirection
Learn how to find things in files
Learn how to sort columns in tables
Learn how to connect to a supercomputer
Learn how to transfer files to and from a supercomputer
Learn how to run simple programs on a supercomputer
Learning outcomes
By the end of this course, learners will be independently able to:
- Fluently navigate and work with files and directories
- Sort a table by the values of a given column
- Find lines in a file where a given keyword appears
- Connect to a supercomputer
- Transfer files from the local computer to the remote one and vice versa
- Prepare the environment to analyse big amount of biological data on a supercomputer
- Run single programs on a supercomputer
- Combine bioinformatics applications into pipelines on a supercomputer
Course prerequisites
The course is directed to biologists with little or no experience in Linux shell scripting and aims at making them capable to use Linux commands autonomously to manage data files and run pipeline programs on supercomputers.
Application Form [FULL]
Preliminary programme
Tuesday - 16 Jan 2018 |
||||
09:00 - 09:30 | Welcome, intro & expectations | |||
09:30 - 10:30 | Lecture & Practical | Introducing the Linux Shell Navigating files and directories Working with files and directories |
||
10:30 - 11:00 | Coffee break | |||
11:00 - 13:00 | Lecture & Practical | Pipes and filters Manipulating text files Finding things Loops |
||
13:00 - 14:00 | Lunch break | |||
14:00 - 16:00 | Lecture & Practical | Shell scripts Running programs |
||
16:00 - 16:30 | Coffee break | |||
16:30 - 18:30 | Lecture & Practical | Working on supercomputers | ||
Wednesday - 17 Jan 2018 |
||||
09:00 - 18:00 | Hands-on | |||
ELIXIR-IIB member institutions
- CNR (ELIXIR-IIB coordinator)
- CRS4
- CINECA
- Fondazione Edmund Mach, Trento
- INFN
- GARR
- Sapienza Università di Roma
- Università di Bari
- Università di Bologna
- Università di Firenze
- Università di Milano
- Università di Milano Bicocca
- Università di Padova
- Università di Parma
- Università di Roma "Tor Vergata"
- Università di Salerno
- Università della Tuscia