assayer is a python RQ-based tool designed to monitor machine learning model checkpoints in real-time. By launching evaluations offline as new checkpoints are created, it effectively tracks model performance without interrupting training. This tool is ideal for managing multiple experiments simultaneously and ensures efficient evaluation even for resource-intensive tasks.
assayer
is a powerful Python RQ-based tool designed to automate the monitoring of machine learning (ML) model checkpoints in designated directories. It facilitates the offline evaluation of these checkpoints as they are generated during training, allowing for efficient tracking of model performance without manual intervention. This capability is particularly beneficial when evaluations are resource-intensive, such as during large-scale language model evaluations.
Key Features
- Concurrent Monitoring:
assayer
can observe multiple experiment directories simultaneously, enabling parallel evaluations for all detected checkpoints. - Scalable Evaluation: Designed to handle multiple evaluation tasks in a quick and efficient manner, even under heavy load from new checkpoints.
Usage Instructions
To get started with assayer
, one must first ensure that a Redis server is running. Then, implement an evaluation method that only requires the checkpoint path as an argument:
# in path/to/some_file.py:
def my_eval(checkpoint_path):
# evaluation logic goes here...
To monitor a specific directory and execute evaluations as new checkpoints are added, run:
python -m assayer.watch --directory path/to/watch_dir --evaluator path/to/some_file.py:my_eval
Monitoring Multiple Directories
assayer
excels when used to monitor multiple directories. Each directory will be continuously observed, with evaluations initiated as new checkpoints emerge:
python -m assayer.watch --directory dir1 --evaluator path/to/eval1.py:eval_fn1
python -m assayer.watch --directory dir2 --evaluator path/to/eval2.py:eval_fn2
python -m assayer.watch --directory dir3 --evaluator path/to/eval3.py:eval_fn3
Configurable Parameters
Parameters such as the number of watch and evaluation workers can be adjusted to optimize performance:
python -m assayer.watch --num_eval_workers 5 --num_watch_workers 1 --directory dir1 --evaluator path/to/eval1.py:eval_fn1
How It Works
assayer
leverages two Redis queues—watch
and evaluation
—to manage its operations efficiently. Upon executing the watch command, a job is queued to monitor the specified directory for new checkpoints, with evaluations triggered for newly detected files. This setup ensures that multiple evaluations can occur concurrently, ideally suited for resource-intensive experiments.
Example: MNIST
To illustrate its capabilities, assayer
includes an example utilizing MNIST training. Check out the examples/mnist/
directory for detailed instructions on setting up and running this example.
FAQs
- Checkpoint Monitoring Frequency: By default, checkpoints are monitored every 5 seconds, adjustable via the
--polling_interval
parameter. - Job Monitoring: Users can track all active evaluation and watch jobs using the command
rq info
or through a web dashboard. More details regarding monitoring can be found in the RQ documentation. - Handling Deletions:
assayer
is designed to ignore deletions and will only evaluate newly created checkpoints.
With its straightforward setup and robust functionality, assayer
serves as an essential tool for those looking to efficiently monitor and evaluate ML model performance.
No comments yet.
Sign in to be the first to comment.