• The Mars Pathfinder probe lands on Mars on July 4th
1997
• After a few days the probe experiences continuous
system resets as a result of a detected critical (timing)
errorerror
Software Architecture
• Cyclic Scheduler @ 8 Hz
• The 1553 is controlled by two tasks:
– Bus Scheduler: bc_sched computes the bus schedule for the
next cycle by planning transactions on the bus (highest priority)
– Bus distribution: bc_dist collects the data transmitted on the bus
and distributes them to the interested parties (third priority level)
– A task controoling entry and landing is second level, there are
other tasks and idle timeother tasks and idle time
• bc_sched must complete before the end of the cycle to setup the
transmission sequence for the upcoming cycle.
– In reality bc_sched and bc_dist must not overlap
bc_schedbc_sched pp
rr
ii
oo
rr
bc_distbc_dist ii
tt
yy
other tasksother tasks
active busactive bus
t1t1 t2t2 t3t3
The problem
• The select mechanism creates a mutual exclusion semaphore to
protect the “wait list” of file descriptors
• The ASI/MET task had called select, which had called pipeIoctl(),
which had called selNodeAdd(), which was in the process of giving the
mutex semaphore. The ASI/ MET task was preempted and semGive()
was not completed.
•• Several medium priority tasks ran until the bc_distSeveral medium priority tasks ran until the bc_dist task was activated. task was activated.
The bc_dist task attempted to send the newest ASI/MET data via the
IPC mechanism which called pipeWrite(). pipeWrite() blocked, taking
the mutex semaphore. More of the medium priority tasks ran, still not
allowing the ASI/MET task to run, until the bc_sched task was
awakened.
• At that point, the bc_sched task determined that the bc_dist task had
not completed its cycle (a hard deadline in the system) and declared
the error that initiated the reset.
• ASI/MET acquires control of the bus (shared resource)
• Preemption of bc_dist
• Lock attempted on the resource
• bc_sched is activated, bc_dist is in execution after the deadline
• bc_sched detects the timing error of bc_dist and resets the system
The Solution
• After debugging on the pathfinder replica at JPL,
engineers discover the cause of malfunctioning as a
priority inversion problem.
• Priority Inheritance was disabled on pipe semaphores
• The problem did not show up during testing, since the
schedule was never tested using the final version ofschedule was never tested using the final version of
the software (where medium priority tasks had higher
load)
• The on-board software was updated from earth and
semaphore parameters (global variables in the
selectLib()) were changed
• The system was tested for possible consequences on
system performance or other possible anomalies but
everything was OK
Y sigue, y sigue, y esquemas y copy-pastes de otros documentos….
Busque las diferencias.
Lamentablemente olvido citar todas sus fuentes.
http://inst.eecs.berkeley.edu/~ee249/fa07/RTOS_Sched.pdf
What happened
• The Mars Pathfinder probe lands on Mars on July 4th
1997
• After a few days the probe experiences continuous
system resets as a result of a detected critical (timing)
errorerror
Software Architecture
• Cyclic Scheduler @ 8 Hz
• The 1553 is controlled by two tasks:
– Bus Scheduler: bc_sched computes the bus schedule for the
next cycle by planning transactions on the bus (highest priority)
– Bus distribution: bc_dist collects the data transmitted on the bus
and distributes them to the interested parties (third priority level)
– A task controoling entry and landing is second level, there are
other tasks and idle timeother tasks and idle time
• bc_sched must complete before the end of the cycle to setup the
transmission sequence for the upcoming cycle.
– In reality bc_sched and bc_dist must not overlap
bc_schedbc_sched pp
rr
ii
oo
rr
bc_distbc_dist ii
tt
yy
other tasksother tasks
active busactive bus
t1t1 t2t2 t3t3
The problem
• The select mechanism creates a mutual exclusion semaphore to
protect the “wait list” of file descriptors
• The ASI/MET task had called select, which had called pipeIoctl(),
which had called selNodeAdd(), which was in the process of giving the
mutex semaphore. The ASI/ MET task was preempted and semGive()
was not completed.
•• Several medium priority tasks ran until the bc_distSeveral medium priority tasks ran until the bc_dist task was activated. task was activated.
The bc_dist task attempted to send the newest ASI/MET data via the
IPC mechanism which called pipeWrite(). pipeWrite() blocked, taking
the mutex semaphore. More of the medium priority tasks ran, still not
allowing the ASI/MET task to run, until the bc_sched task was
awakened.
• At that point, the bc_sched task determined that the bc_dist task had
not completed its cycle (a hard deadline in the system) and declared
the error that initiated the reset.
• ASI/MET acquires control of the bus (shared resource)
• Preemption of bc_dist
• Lock attempted on the resource
• bc_sched is activated, bc_dist is in execution after the deadline
• bc_sched detects the timing error of bc_dist and resets the system
The Solution
• After debugging on the pathfinder replica at JPL,
engineers discover the cause of malfunctioning as a
priority inversion problem.
• Priority Inheritance was disabled on pipe semaphores
• The problem did not show up during testing, since the
schedule was never tested using the final version ofschedule was never tested using the final version of
the software (where medium priority tasks had higher
load)
• The on-board software was updated from earth and
semaphore parameters (global variables in the
selectLib()) were changed
• The system was tested for possible consequences on
system performance or other possible anomalies but
everything was OK
Y sigue, y sigue, y esquemas y copy-pastes de otros documentos….
http://research.microsoft.com/en-us/um/people/mbj/Mars_Pathfinder/Authoritative_Account.html
http://feanor.sssup.it/~pj/rtos-arezzo/2005/mars_explorer.pdf
http://www.mvps.org/st-software/Movie_Collection/images/7775f.jpg