Contributing

We are so glad that you want to contribute to our project! Here are some guidelines to help you get started:

We are using git for version control, if you are familiar with git feel free to skip this section. The first thing you will need is a github account. If you don’t have one, you can create one at github.com. After that you can fork the repository by clicking the “Fork” button in the top right corner of the repository page. This will create a copy of the repository in your own account.

Here are some basic git instuctions:

Creating a local git repo

Github is built on top of git. Git is a popular code version control system that tracks your edits to files and whether new files are added or old ones are delelted. To create a git repository in a directory of your choosing

cd mydirectory
git init

this will create a blank git repository and you will be in the main branch.

Better yet clone this repository using:

git clone https://github.com/celalp/mnemosyne

This branch is reserved for the “production” code and things will not be added here until they are tested and ready to go. You can create a new branch by:

git branch new_branch

you can switch between branches using:

git checkout old_branch

to add new files/folder to your git repository use:

git add new_file.py

to commit use:

git commit -m "commit message"

See below for more guidelines on branching, commit pushes and pull requests.

You can configure your remote repository with:

git remote add origin https://github.com/celalp/mnemosyne

and you can push using

git push -u origin <branch name>

Please create a .gitignore file to keep the unwanted from being added and commited to the repository. You can also use the exisiting file and make changes as you see fit.

Commiting guidelines

Please make sure to write clear and concise commit messages. A good commit message should explain what changes were made and why. Do not try to make multiple unrelated changes in a single commit. Instead, break them down into smaller, logical commits, this not only makes it easier to review your changes, but also helps in tracking down issues later on. When you want to create a pull request make sure that it is related to a single feature or bug fix. If you have multiple features or bug fixes, create separate pull requests for each one. This will allow us to review and merge them independently, making the process smoother and more efficient.

Branching guidelines

When you are working on a new feature or bug fix, please create a new branch for your changes. This will help keep the main branch clean and will also avoid conflicts with other contributors.

Code Style

As you can tell this is mainly a Python project, so please follow the PEP 8 style guide for Python code. To make everyone’s life easier we are going to keep a very relaxed style guide. We are not going to enforce any specific line length, or any specific indentation style. We do however ask that you use spaces instead of tabs for indentation, and that you use 4 spaces for each level of indentation.

Find a clear but short name for your task obviously myawesomecode is not an appropritae name for a folder or python code but neither is scibertsummarizerforclinicalnotesbasedonpreviouscomments. Something like summarizer is a much better choice. While naming folders and files try and be as explicit as possible so if your code does not summarize but select sections of notes section_selector might be more suitable.

Within each folder there should be at least one python module with the same name, this module will contain the main code that does the task. This does not mean that it will contain all the code related to the task. Have a clear separtion of different kinds of things each module does and try to contain each of these in their respective module. You can make this as a CLI script that is callable with arguments (see below) or you can choose to include another file (this can be python or bash -let’s keep things standard, if you use bash please set -oe pipefail).

For processing data, and general manipulation tasks create a module called utils.py this will contain the utilities that are helper functions and classes but do not perform the main task. For example: a function that takes all the tab (\t) and converts them to new lines (\n) will be in the utils.

If this module is going to be part of the knowledge base and will be used by other modules, please make sure that you describe in detail how it works and what it does in the README.md file.

Additionally you will need to structure your data in a way that it can be stored in a normalized SQL database. This means that you will need to create a module called tables.py and that will contain the SQLAlchemy models that will be used to generate and populate the tables in the database. Some of the modules already have this files, so feel free to use them as a reference.

If you are not familiar with SQLAlchemy, please take a look at the SQLAlchemy documentation and if you have any questions about how to sructure your data in a way that it can be stored in a normalized SQL database, see here for reference and you can always reach out to us via issues.

There is no limit to how many modules you can create but one helpful rule I find is to focus on the task not on the code. Each task should get its own module that is then imported by the main module.

In addition to code your folder should also contain a README.md that is properly formatted in markdown style (like this README). This document will include:

A brief summary of what the task(s) is (are)
How they are accomplished
list of 3rd party modules
detailed description of modules
usage instructions for main classes and functions

If you are using conda you can choose to inlcude an enviroment.yaml. Please use these names in and not something else to make sure that we are all in the same page.

Code style

This is a python project so at the very least we will stick to PEP8 guides with 120 characters per line (we can change that if you’d like).

Each function/class should contain detailed docstring that has at the very least the following information

What does the function do use common sense in describing the function, if the task is simple the description can be simple
parameters and types, we can use reST style docstrings
outputs

for inputs like *args and **kwargs describe how they might be used and how they are passed to different functions inside the function.

Avoid lambdas unless the task is extremely simple, same goes for list comprehensions. There is no performance cost/benefit but a for loop is much easier to read.

For classes use CamelCase, for functions use lowercase. In classes there should be a docstring for the class as well as class methods like so:

class NewClass:
    """
    this class does something awesome
    """

    def __init__(self, input1, input2):
        """
        initiate a new instance of NewClass with some basic calculations and some other things
        param: self:, self NewClass
        param: input1: an input1
        param: input2 an input2
        type: input1: pandas DataFrame
        type: input2: bool
        return: a dict of different awesome results
        rtype: dict
        """
        pass

    def method1(self, *args, **kwargs):
        pass

Feel free to structure your code however you wish as long as it’s well documented. That said try and avoid exotic cases python inheritance cases and global variables and scoping out variables using global. Each function/class should be self contained and any input(s) it relies on should be passed during function call.

Use common sense when you are structuring your code, if you really need Subclassing go for it, if you really need mixins that’s ok too but with complexity comes side effects and convoluted code. If you think you need some of these features please feel free to reach out and we can discuss if we can have a simpler architecture.

If you like using type hints please do so, but do not feel obligated to use them. If you do use them please make sure that the types are correct and that they are used consistently throughout the codebase. Whenever applicaple at least describe what types are expected for the inputs and outputs of the functions within the docstring.

Lazy vs Eager eval

Try and write your code as lazy as possible. Nothing should be calculated/processed/edited unless that method is explicitly called. If you want method chaining that’s ok too but make sure that you really need it.

Dunder (“__”) methods and operator overloading

If you choose you can set up __str__ and __repr__ methods of your classes and subclasses. If you want to do operator overloading please have a good reason to do so and make sure that it is well documented in your code and README.

Errors and Exceptions

Please code as defensively as possible. There are a lot of built-in exceptions that you can use to catch errors that you can foresee happening like a FileNotFoundError. Feel free to create your own exceptions like so:

class NewException(Exception):
    pass

Threading and multicore processing

Currently, I think we are all using python 3.10. While python does allow multithreaded applications with the threading module it is complicated to use. You can choose to multi core processing but please provide arguments (see below) to allow user to set up the number of cores that can be used. While performing multiprocessing keep in mind that your RAM usage basically multiplies with the number of cores you are using. Be mindful and don’t crash the VM (not a big deal it just would take a bit for me to reset and everyone will be kicked out until the reset is done).

Arguments and settings

For simple CLI arguments use the argparse module. This is an extremely flexible module and you can have subparsers for different modes of analysis. Please do not use a 3rd parth module like click. There is no need to increase the number of dependencies.

If your code requires extensive parameters (it might for experimentation) you can have a json or a yaml file to keep these values as a key:value store. Make sure that this file location is NOT hardcoded but rather passed as an argument in the callable script.

Push guides

As long as you are following the guidelines above you can push to your branches as much as you want. Github tracks your git repo so if you have done multiple commits a single push will show up as multiple commits on the remote repo as well.

Pull requests

If you want to contribute to someone else’s code please create a pull request unless you are actively working with that person. The tagged person will then review the code and will approve or edit as they see fit. Save for the simplest of taskt please keep the discussion within the issues section so we all know what’s going on.

Other Issues

I am not very familiar with the idea of qualitative research. If you have any suggestions about the overall workflow or some other ideas about how to improve the project please feel free to reach out through issues.