Находки в опенсорсе

Mypy stubs, i.e., type information, for numpy, pandas and matplotlib for your #ds #python projects.

Lots of functions are already typed, but a lot is still missing (numpy and pandas are huge libraries).

https://github.com/predictive-analytics-lab/data-science-types

14.4K views12:50

🔥 13 🤔 2

Находки в опенсорсе

Great Expectations: Always know what to expect from your data.

Great Expectations helps data teams eliminate pipeline debt, through data testing, documentation, and profiling.

Software developers have long known that testing and documentation are essential for managing complex codebases. Great Expectations brings the same confidence, integrity, and acceleration to data science and data engineering teams.

See Down with Pipeline Debt! for an introduction to the philosophy of pipeline testing: https://medium.com/@expectgreatdata/down-with-pipeline-debt-introducing-great-expectations-862ddc46782a

Key features:
- Expectations or assertions for data. They are the workhorse abstraction in Great Expectations, covering all kinds of common data issues
- Batteries-included data validation
- Tests are docs and docs are tests: many data teams struggle to maintain up-to-date data documentation. Great Expectations solves this problem by rendering Expectations directly into clean, human-readable documentation
- Automated data profiling: wouldn't it be great if your tests could write themselves? Run your data through one of Great Expectations' data profilers and it will automatically generate Expectations and data documentation
- Pluggable and extensible

https://github.com/great-expectations/great_expectations

#python #ds #docops

5.47K views07:05

🔥 6 🤔 1

Находки в опенсорсе

A next-generation curated knowledge sharing platform for data scientists and other technical professions.

The Knowledge Repo project is focused on facilitating the sharing of knowledge between data scientists and other technical roles using data formats and tools that make sense in these professions. It provides various data stores (and utilities to manage them) for "knowledge posts", with a particular focus on notebooks (R Markdown and Jupyter / IPython Notebook) to better promote reproducible research.

For more information about the motivation and inspiration behind this project, we encourage you to read our Medium Post: https://medium.com/airbnb-engineering/scaling-knowledge-at-airbnb-875d73eff091

https://github.com/airbnb/knowledge-repo

#python #ds #docops

2.6K views08:00

🔥 12 🤔 4

Находки в опенсорсе

The Machine Learning Toolkit for Kubernetes

The Kubeflow project is dedicated to making deployments of machine learning (ML) workflows on Kubernetes simple, portable and scalable. Our goal is not to recreate other services, but to provide a straightforward way to deploy best-of-breed open-source systems for ML to diverse infrastructures. Anywhere you are running Kubernetes, you should be able to run Kubeflow.

Features:
- Kubeflow includes services to create and manage interactive Jupyter notebooks.
- Kubeflow provides a custom TensorFlow training job operator that you can use to train your ML model.
- Kubeflow supports a TensorFlow Serving container to export trained TensorFlow models to Kubernetes.
- Kubeflow Pipelines is a comprehensive solution for deploying and managing end-to-end ML workflows.
- Our development plans extend beyond TensorFlow. We're working hard to extend the support of PyTorch, Apache MXNet, MPI, XGBoost, Chainer, and more. We also integrate with Istio and Ambassador for ingress, Nuclio as a fast multi-purpose serverless framework, and Pachyderm for managing your data science pipelines.

https://www.kubeflow.org/

#devops #ds

YouTube

Introduction to Kubeflow

In this first episode of Kubeflow 101, we give an overview of Kubeflow → https://goo.gle/394UQu6

Kubeflow is an open-source project containing a curated set of compatible tools and frameworks specific for ML.

Learn more about the three principles that…

2.43K views07:08

🔥 4 🤔 4

Находки в опенсорсе

Github action that uses machine learning to detect potential toxic comments added to PRs and issues so authors can have a chance to edit them and keep repos a safe space.

It uses the Tensorflow.js toxicity classification model.

It currently works when comments are posted on issues and PRs, as well as when pull request reviews are submitted.

https://github.com/charliegerard/safe-space

#ds #js #github

👍1

2.78K views08:54

🔥 14 🤔 22

Находки в опенсорсе

Quality assurance for Jupyter Notebooks.

Adapter to run any code-quality tool on a Jupyter notebook. This is intended to be run as a pre-commit hook and/or during continuous integration.

Can run: flake8, mypy, black, isort, doctest.
Also works with wemake-python-styleguide!

https://github.com/nbQA-dev/nbQA

#ds #python

2.63K views11:40

🔥 11 🤔 1

Находки в опенсорсе

Machine Learning models should play by the rules, literally. Natural Intelligence is still a pretty good idea.

Back in the old days, it was common to write rule-based systems.
Nowadays, it's much more fashionable to use machine learning instead.

We started wondering if we might have lost something in this transition. Sure, machine learning covers a lot of ground but it is also capable of making bad decision. We've also reached a stage of hype that folks forget that many classification problems can be handled by natural intelligence too.

This package contains scikit-learn compatible tools that should make it easier to construct and benchmark rule based systems that are designed by humans. You can also use it in combination with ML models.

This tool allows you to draw over your datasets. These drawings can later be converted to models or to preprocessing tools.

https://koaning.github.io/human-learn/

#python #ds

2.77K views12:38

🔥 13 🤔 5

Находки в опенсорсе

A tiny Catalyst-like experiment runner framework on top of micrograd.

Implements Experiment, Runner and Callback Catalyst-core abstractions and has extra PyTorch-like micrograd modules - MicroLoader, MicroCriterion, MicroOptimizer and MicroScheduler.

Every module is tiny, with about 100 lines of code (even this readme). However, this is enough to make Kittylyst easily extendable for any number of data sources and support multi-stage experiments, as the demo notebook shows.

Potentially useful for educational purposes.

https://github.com/Scitator/kittylyst

#python #ds

2.74K views13:40

Находки в опенсорсе

Statistical Data Validation for Pandas.

A data validation library for scientists, engineers, and analysts seeking correctness.

pandera provides a flexible and expressive API for performing data validation on tidy (long-form) and wide data to make data processing pipelines more readable and robust.

pandas data structures contain information that pandera explicitly validates at runtime. This is useful in production-critical data pipelines or reproducible research settings. With pandera, you can:

- Check the types and properties of columns in a pd.DataFrame or values in a pd.Series.
- Perform more complex statistical validation like hypothesis testing.
- Seamlessly integrate with existing data analysis/processing pipelines via function decorators.
- Define schema models with a class-based API with pydantic-style syntax and validate dataframes using the typing syntax.

https://pandera.readthedocs.io/en/stable/index.html

#python #ds

3.03K views08:10

🔥 22 🤔 4

About

Blog

Apps

Platform