Documentation/Storage architecture

From Systems

Jump to: navigation, search

Contents

Storage Architecture

Storage architecture

C2B2 manages a robust, scalable storage architecture that provides storage services for a number of applications, ranging from desktop file storage to high-performance computing applications. All of this data is managed on our multi-tiered Isilon system in conjunction with a large LTO4 tape library (see figure).

By using a combination of multiple storage technologies and platforms we are able to create multiple storage areas that can meet varying storage objectives for data integrity, performance and capacity. For instance, we have high-performance storage areas that provide high-speed access for cluster computing and "slow" storage systems that provide cheaper capacity. The data storage area makes use of both of these storage platforms, migrating older data into cheaper storage (and back into faster storage as it is accessed again).

We use the following technologies in our storage architecture:

  • quotas - While we have nearly one petabyte of storage presently, we need a way to keep track of that space. Quotas allow us to specify storage limits based on user, group or directory. We primarily use directory based quotas.

Hard limit - Attempts to write data over this limit will be rejected. Any change in storage must be approved by the Principal Investigator.

Soft limit - The soft limit will allow additional writes for the grace period (currently 7 days), up to the hard limit. When the grace period expires no additional writes will be allowed until data is deleted and the total stored data is under the soft limit.

Advisory limit - This is purely informative just to let you know you're approaching the soft and hard limits.

The soft limit can be removed, so you use only an advisory limit and hard limit. We recommend keeping the soft limit in place but can remove it if you agree to prudently self-monitor your storage.


  • snapshots - Snapshots are a form of data backup that stores copies of data on the same storage system that it originates from. The primary advantage of snapshots is that they do not fully duplicate the data. Instead, they only make copies of data as it changes. This provides a point-in-time backup with minimal storage capacity overhead. On an area where data does not change rapidly (such as our home area), snapshots can be taken frequently providing short term "oops" recovery for files (i.e. the ability to quickly restore files that have been accidentally altered or deleted).
  • replication - We use replication to keep data on one storage system synchronized with data on another system. In combination with snapshots, this gives a point-in-time backup copy of replicated data. Because replicated data lives on a separate system in a separate location, this provides a solid backup of data that can survive disaster or system failure.
  • backup or tape backup - Periodically we write data to tape so that we can physically move the tapes to an offsite location, providing an even stronger option for disaster recovery. This also provides a long term archive of the data.

Storage areas

There are 4 main storage areas on the Isilon storage system, accessible as separate directories. This division is made to make available different storage tiers, such as replication, snapshots and high-performance storage.


Area descriptions:

  • home - users home directories used for documents and program source files. We snapshot this area 4 times daily. One of the snapshots per day is saved for a week and each of these is saved for up to a month. The home area is regularly replicated to a separate storage system at a second location and periodically written to tape and sent to an offsite storage location. These directories are read-only from cluster nodes and computing should not be done in these directories.
  • data - used for saving computational results. Computing, particularly cluster computing should not be done in these directories. This area is for storing downloaded and computed data only. While computing should also not be done in these directories, files may be written from the cluster nodes. Temporary files should never be written to these directories. The data area is regularly replicated to a separate storage system at a second location and periodically written to tape and sent to an offsite storage location.
  • scratch - high-speed storage used for cluster computation. Cluster computing files (current working directories, temporary, and permanent files) should be here. It is preferable if all cluster computing I/O originates from this area so that cluster computing does not interfere with regular file system access for backup and desktop clients. After computation is finished results may be written back to the data directories. Data in scratch is not automatically deleted, but no replication or backups are made of this data; however, the storage that scratch lives on is highly reliable.
  • archive - used for archival of rarely needed data. This provides a cheap storage option designed specifically for archival of data sets that are not actively needed, but should not be deleted. This area has no active backups, but lives on a highly reliable storage system with built-in redundancy.


Feature matrix of the four storage areas
Storage area Cluster access Snapshots Replication Tape backup Permanent archive *
home read-only 4x daily daily quarterly annually
data read/write monthly daily semi-annually annually
scratch read/write None None None None
archive None None None None None

* Permanent archives are written to tape and relocated to an offsite storage facility.

Directory Structure

The top level directory, /ifs, is subdivided into 4 directories representing the 4 storage areas. The next level is divided between major organizations (e.g. c2b2, cancer, columbia, external,...). The organization level is subdivided by sub-organization/lab (e.g. ac_lab for Andrea Califano lab, etc). Finally, the sub-orgazation is divided into user folders that match usernames (e.g. jw2632, pg, etc.).

At each level below the area level, there is a "shares" folder that represents shares across whichever logical level. For example, /ifs/home/c2b2/shares contains shared directories with home level resources for all c2b2 members. The directory structure is illustrated in the figure below.

"Directory Structure"

Usage

Policies

Please read these notices carefully. Any failure to adhere to these policies is a violation of C2B2 use guidelines and will result in the suspension of your account. In some cases, deletion of data and/or legal action may be required.

  • C2B2 storage is intended for work related activities only. Any unrelated or personal usage of the system will not be tolerated.
  • The Isilon storage system is not certified to meet HIPAA or PHI requirements for data storage systems. Identified patient data should never be stored on this system!
  • All restrictions established on this system must be adhered to, including but not limited to: quotas, network access restrictions, mounting restrictions, file/directory security restrictions, user account restrictions. Any attempt to circumvent these restrictions will be met with immediate action.
  • User accounts and credentials should never be shared in order to share data. Please contact C2B2 support if you need assistance sharing your data.
  • Users' home directories must not be readable by any user other than the owner of the folder. If you need to share a directory under your home directory you can make your home directory executable (+x) but not readable (-r). This will allow other users to traverse your directory, but not read it's contents.
  • This is a shared use system. Everyone must be respectful of other users and their right to use the system. Any attempt to "hog" resources or prevent other users from fair use of the system will not be tolerated.

Usage suggestions

We suggest the following usage of the different storage areas:

  • home - These directories will not be writable from cluster nodes and computing should not be done in these directories. They are ideal for storing documents, development projects, or other files that may need to have point-in-time recovery.
  • data - Computing shoud also not be done in these directories, but files may be written from the cluster nodes. Writing temporary files in these directories is a bad idea and will result in poor performance for everyone. This is a good location for, e.g. genomic databases, or storing computational results that need to be backed up.
  • scratch - It is preferable if all cluster computing I/O originates from this area so that the cluster computing does not interfere with regular file system access for backup and desktop clients. After computation is finished results may be written back to the data directories or archive.
  • archive - Any files that will not need to be actively read or written and need reliable, but not necessarily disaster recovery tolerant archival storage, should be moved to this area. This area is low cost and provides a way to free up space in other directories.

You may wish to put symbolic links to your data, scratch and archive areas in your home directory.

Accessing storage

You can access your storage areas by logging into login.c2b2.columbia.edu using SSH. This will connect you to one of the pool of login nodes where all storage areas are available. The login nodes are available whether or not you are on the Columbia network.

To access files directly from your desktop you need to be on the Columbia University Medical Campus network (this can also be done from an offsite location while in a VPN session).

To mount your home directory in windows you need to go to My Computer -> Tools -> Map Network Drive.

You can map:

  • \\isilon.c2b2.columbia.edu\Department\Lab_name\username
  • \\isilon.c2b2.columbia.edu\Department_data\Lab_name\username
  • \\isilon.c2b2.columbia.edu\Department_scratch\Lab_name\username
  • \\isilon.c2b2.columbia.edu\Department_archive\Lab_name\username

If your computer is not on the ARCS Active Directory domain you need to map it using a different username (ARCS\username). Departments are, for example: c2b2, cancer, columbia, hemonc, msph, nypi, or pimri. Lab names in C2B2 usually follow the convention PI-initials_lab. Lab names in Cancer usually follow the convention Lab_PI-initials.

To mount your home directory on a mac, open Finder and select Go -> Connect to server

On a mac you map with:

  • smb://isilon.c2b2.columbia.edu/Department/Lab_name/username
  • smb://isilon.c2b2.columbia.edu/Department_data/Lab_name/username
  • smb://isilon.c2b2.columbia.edu/Department_scratch/Lab_name/username
  • smb://isilon.c2b2.columbia.edu/Department_archive/Lab_name/username

If your computer is not on the ARCS Active Directory domain you need to map it using a different username (ARCS/username). Departments are, for example: c2b2, cancer, columbia, hemonc, msph, nypi, or pimri. Lab names in C2B2 usually follow the convention PI-initials_lab. Lab names in Cancer usually follow the convention Lab_PI-initials.

To mount your home directory on a linux workstation:

  • On the top menu bar, go to Places - Connect to Server - and type in isilon.c2b2.columbia.edu.
  • For Service Type, change it to Windows Share.
  • Server - isilon.c2b2.columbia.edu
  • Share - Department
  • Folder - (PI_lab or PI_lab_data or PI_lab_scratch)/username
  • User Name - username
  • Domain Name - ARCS
  • Check the Add Bookmark and put in a Bookmark name.
  • It will prompt you for a password and then mount the drive.

In the future, you will be able to connect to the share by going to Places and clicking on the Bookmark.

Personal tools