RELEASE NOTES FOR SLURM VERSION 2.2
1 December 2010


IMPORTANT NOTE:
If using the slurmdbd (SLURM DataBase Daemon) you must update this first.
The 2.2 slurmdbd will work with SLURM daemons of version 2.1.3 and above.
You will not need to update all clusters at the same time, but it is very
important to update slurmdbd first and having it running before updating
any other clusters making use of it.  No real harm will come from updating
your systems before the slurmdbd, but they will not talk to each other
until you do.  Also at least the first time running the slurmdbd you need to
make sure your my.cnf file has innodb_buffer_pool_size equal to at least 64M.
You can accomplish this by adding the line

innodb_buffer_pool_size=64M

under the [mysqld] reference and restarting the mysqld.  This is needed when
converting large tables over to the new database schema.

SLURM can be upgraded from version 2.1 to version 2.2 without loss of jobs or
other state information.


HIGHLIGHTS
==========
* Slurmctld restart/reconfiguration operations have been altered.
  NOTE: There will be no change in behavior unless partition configuration
  or node Features/Weight are altered using the scontrol command to differ
  from the contents of the slurm.conf configuration file.

  Preserve current partition state information plus node Feature and Weight
  state information after slurmctld receives a SIGHUP signal or is restarted
  with the -R option. Recreate partition plus node information (except node
  State and Reason) from slurm.conf file after executing "scontrol reconfig"
  or restarting slurmctld *without* the -R option.

     OPERATION            ACTION
     slurmctld -R         Recover all job, node and partition state
     slurmctld            Recover job state, recreate node and partition state
     slurmctld -c         Recover no jobs, recreate node and partition state
     SIGHUP to slurmctld  Preserve all job, node and partition state
     scontrol reconfig    Preserve job state, recreate node and partition state

  Old logic preserved node Feature plus partition state after "slurmctld" or
  "scontrol reconfig" rather than recreating it from slurm.conf. Node Weight
  was formerly always recreated from slurm.conf.

* SLURM commands (squeue, sinfo, sview, etc...) can now operate between
  clusters. Jobs can also be submitted with sbatch to other cluster(s) with the
  job routed to the one cluster expected to initiated the job first.

* Accounting through the SlurmDBD with the MySQL plugin can now support
  a default account and wckey per cluster.

CONFIGURATION FILE CHANGES (see "man slurm.conf" for details)
=============================================================
* A hash of the slurm.conf running on each node in the cluster is sent when
  registering with the slurmctld so it can verify the slurm.conf is the same
  as the one it is running.  If not an error message is displayed.  To
  silence this message add NO_CONF_HASH to DebugFlags in your slurm.conf.

* Added VSizeFactor to enforce virtual memory limits for jobs and job steps as
  a percentage of their real memory allocation.

* Added new option for SelectTypeParameters of CR_ONE_TASK_PER_CORE. This
  option will allocate one task per core by default. Without this option,
  by default one task will be allocated per thread on nodes with more than
  one ThreadsPerCore configured (i.e. no change in behavior without this
  option).

* Add new configuration parameters GroupUpdateForce and GroupUpdateTime. These
  control when slurmctld updates its information of which users are in the
  groups allowed to use partitions. NOTE: There is no change in the default
  behavior.

* Added new configuration parameters SlurmSchedLogFile and SlurmSchedLogLevel
  to support writing scheduling events to a separate log file.

* Added new configuration parameter JobSubmitPlugins which provides a mechanism
  to set default job parameters or perform other site-configurable actions at
  job submit time. Site-specific job submission plugins may be written either C
  or LUA.

* MaxJobCount changed from 16-bit to 32-bit field. The default MaxJobCount was
  changed from 5,000 to 10,000.

* Added support for a PropagatePrioProcess configuration parameter value of 2
  to restrict spawned task nice values to that of the slurmd daemon plus 1.
  This insures that the slurmd daemon always have a higher scheduling priority
  than spawned tasks. Also added support in slurmctld, slurmd and slurmdbd for
  option of "-n <value>" to reset the daemon's nice value.

* Support has been added for the allocation of generic resources (GRES). A
  new configuration parameter, GresPlugins, has been added along with a node-
  specific parameter, Gres. There is also a gres.conf file to be configured on
  each node. For more information, see the web page
  https://computing.llnl.gov/linux/slurm/gang_scheduling.html
  Support for enforcement of these allocations using Linux CGroup will be
  provided in a later release.

* Added support for new partition states of DRAIN (run queued jobs, but accept
  no new jobs) and INACTIVE (do not accept or run any more jobs) and new
  partition option of "Alternate" (alternate partition to use for jobs
  submitted to partitions that are currently in a state of DRAIN or INACTIVE).

* Added the ability to configure PreemptMode on a per-partition or per-QOS
  basis.

* Modified the meaning of InactiveLimit slightly. It will now cancel the job
  allocation created using the salloc or srun command if those commands cease
  responding for the InactiveLimit regardless of any running job steps. This
  parameter will no longer effect jobs spawned using sbatch.

* Added SchedulerParameters option of bf_window to control how far into the
  future that the backfill scheduler will look when considering jobs to start.
  The default value is one day.

* Added the ability to specify a range of ports in the SlurmctldPort parameter
  for better handling of high bursts of RPCs (e.g. "SlurmctldPort=1234-1237").

COMMAND CHANGES (see man pages for details)
===========================================
* sinfo -R now has the user and timestamp in separate fields from the reason.

* Job submission commands (salloc, sbatch and srun) have a new option,
  --time-min, that permits the job's time limit to be reduced to the extent
  required to start early through backfill scheduling with the minimum value
  as specified.

* scontrol now has the ability to change a job step's time limit.

* scontrol now has the ability to shrink a job's size. Use a command of
  "scontrol update JobId=# NumNodes=#" or
  "scontrol update JobId=# NodeList=<names>". This command generates a script
  to be executed in order to reset SLURM environment variables for proper
  execution of subsequent job steps.

* We have given Operators, Administrators, and bank account Coordinators (as
  defined in the SLURM database) the ability to invoke commands that view/modify
  user jobs and reservations.  Previously, one had to be root to invoke
  "scontrol update JobId" for example.  In addition, Administrators have the
  ability to view/modify node and partition info without having to become root.
  For moredetails, see AUTHORIZATION section of the man pages for the
  following commands: scontrol, scancel and sbcast.

* Users can hold and release their own jobs. Submit in held state using srun
  or sbatch --hold or -H options. Hold after submission using the command
  "scontrol hold <jobid>". Release with "scontrol release <jobid>". Users can
  not release jobs held by a system administrator unless the adminstrator uses
  the command "scontrol uhold <jobid>" ("uhold" for "user hold").

* Add support for slurmctld and slurmd option of "-n <value>" to reset the
  daemon's nice value.

* srun's --core option has been removed. Use the SPANK "Core" plugin from
  http://code.google.com/p/slurm-spank-plugins/ for continued support.

* Added salloc and sbatch option --wait-for-nodes. If set non-zero, job
  initiation will be delayed until all allocated nodes have booted. Salloc
  will log the delay with the messages "Waiting for nodes to boot" and "Nodes
  are ready for use".

* Added scontrol "wait_job <job_id>" option to wait for nodes to boot as needed.
  Useful for batch jobs (in Prolog, PrologSlurmctld or the script) if powering
  down idle nodes.

* Modified sview to display database configuration and add/remove visible tabs.

* Modified sview to save default configuration in .slurm/sviewrc file.
  Default setting can be set by using the menus Options->Set Default Settings
  or typing Ctrl-S.

* Modified select/cons_res plugin so that if MaxMemPerCPU is configured and a
  job specifies it's memory requirement, then more CPUs than requested will
  automatically be allocated to a job to honor the MaxMemPerCPU parameter.

BLUEGENE SPECIFIC CHANGES
=========================

OTHER CHANGES
=============
* Added support for a default account and wckey per cluster within accounting.

* Added support for several new trigger types: SlurmDBD failure/restart,
  Database failure/restart, Slurmctld failure/restart.

* Support has been added for TotalView to attach to a subset of launched tasks
  instead of requiring that all tasks be attached to. This is the default
  behavior unless an option of "--enable-partial-attach=no" be passed to the
  configure (build) script.

* A web application (chart_stats.cgi) has been added that invokes sreport to
  retrieve from the accounting storage db a user's request for job usage or
  machine utilization statistics and charts the results to a browser.

* Much functionality has been added to account_storage/pgsql.  The plugin
  is still in a very beta state.

* SLURM's PMI library (for MPICH2) has been modified to properly execute an
  executable program stand-alone (single MPI task launched without srun).

* The PMI was also modified to use more socket connections for better
  scalability and to clear state between job step invocations.

* Added support for spank_get_item() to get S_STEP_ALLOC_CORES and
  S_STEP_ALLOC_MEM. Support will remain for S_JOB_ALLOC_CORES and
  S_JOB_ALLOC_MEM.

* Changed error message from "Requested time limit exceeds partition limit"
  to "Requested time limit is invalid (exceeds some limit)". The error can be
  triggered by a time limit exceeding the user/bank limit or the time-min
  exceeding the job or partition's time limit.

* Added proctrack/cgroup plugin which uses Linux control groups (aka cgroup) to
  track processes on Linux systems with this feature (kernel >= 2.6.24).

* Added the derived_ec (exit code) member to job_info_t.  exit_code captures
  the exit code of the job script (or salloc) while derived_ec contains the
  highest exit code of all the job steps.

* Added the derived exit code and derived exit string fields to the database's
  job record.  Both can be modified by the user after the job completes.  See
  job_exit_code.html


API CHANGES
===========

Changed members of the following structs
========================================
job_info_t
	num_procs -> num_cpus
	job_min_cpus -> pn_min_cpus
	job_min_memory -> pn_min_memory
	job_min_tmp_disk -> pn_min_tmp_disk
	min_sockets -> sockets_per_node
	min_cores -> cores_per_socket
	min_threads -> threads_per_core

job_desc_msg_t
	num_procs -> min_cpus
	job_min_cpus -> pn_min_cpus
	job_min_memory -> pn_min_memory
	job_min_tmp_disk -> pn_min_tmp_disk
	min_sockets -> sockets_per_node
	min_cores -> cores_per_socket
	min_threads -> threads_per_core

partition_info_t
	state_up (new states added PARTITION_DRAIN and PARTITION_INACTIVE)
	default_part -> flags (as PART_FLAG_DEFAULT flag)
	disable_root_jobs -> flags (as PART_FLAG_NO_ROOT flag)
	hidden -> flags (as PART_FLAG_HIDDEN flag)
	root_only -> flags (as PART_FLAG_ROOT_ONLY flag)

slurm_step_ctx_params_t
	node_count -> min_nodes

slurm_ctl_conf_t
	cache_groups -> group_info (as GROUP_CACHE flag)


Added the following struct definitions
======================================
block_info_t (BlueGene-specific information)
	reason

job_info_t
	derived_ec
	gres
	max_cpus
	resize_time
	show_flags
	time_min

job_desc_msg_t
	gres
	max_cpus
	time_min
	wait_all_nodes

job_step_info_t
	gres

node_info_t
	boot_time
	gres
	reason_time
	reason_uid
	slurmd_start_time

partition_info_t
	alternate
	flags
	preempt_mode

slurm_ctl_conf_t
	gres_plugins
	group_info
	hash_val
	job_submit_plugins
	sched_logfile
	sched_log_level
	slurmctld_port_count
	vsize_factor

slurm_step_ctx_params_t
	features
	gres
	max_nodes

update_node_msg_t
	gres
	preempt_mode
	reason_uid


Changed the following enums
===========================
job_state_reason
	FAIL_BANK_ACCOUNT -> FAIL_ACCOUNT
	FAIL_QOS        	/* invalid QOS */
	WAIT_QOS_THRES        	/* required QOS threshold has been breached */

select_jobdata_type
	SELECT_JOBDATA_PTR	/* data-> select_jobinfo_t *jobinfo */

select_nodedata_type
	SELECT_NODEDATA_PTR     /* data-> select_nodeinfo_t *nodeinfo */

select_type_plugin_info is no longer and it's contents are now mostly #defines

Added the following API's
=========================
slurm_checkpoint_requeue()
slurm_init_update_step_msg()
slurm_job_step_get_pids()
slurm_job_step_pids_free()
slurm_job_step_pids_response_msg_free()
slurm_job_step_stat()
slurm_job_step_stat_free()
slurm_job_step_stat_response_msg_free()
slurm_list_append()
slurm_list_count()
slurm_list_create()
slurm_list_destroy()
slurm_list_find()
slurm_list_is_empty()
slurm_list_iterator_create()
slurm_list_iterator_reset()
slurm_list_iterator_destroy()
slurm_list_next()
slurm_list_sort()
slurm_set_schedlog_level()
slurm_step_launch_fwd_wake()
slurm_update_step()


Changed the following API's
===========================
slurm_load_block_info(): Added show_flag parameter

.