Nice tips for those who use Airflow, worth reading
https://medium.com/datareply/airflow-lesser-known-tips-tricks-and-best-practises-cf4d4a90f8f
#big_data
https://medium.com/datareply/airflow-lesser-known-tips-tricks-and-best-practises-cf4d4a90f8f
#big_data
Medium
Airflow: Lesser Known Tips, Tricks, and Best Practises
Lesser known Tips, Tricks, and Best Practises to use Apache Airflow and develop DAGs like a Pro
Type checking in runtime
By default, function annotations do not influence how your code is working, but merely help you to point code intentions and raise linting errors.
However, you can enforce type checking in runtime with tools like enforce, this can help you in debugging (there are many cases when type hinting is not working).
#python
By default, function annotations do not influence how your code is working, but merely help you to point code intentions and raise linting errors.
However, you can enforce type checking in runtime with tools like enforce, this can help you in debugging (there are many cases when type hinting is not working).
@enforce.runtime_validation
def foo(text: str) -> None:
print(text)
foo('Hi') # ok
foo(5) # fails
@enforce.runtime_validation
def any2(x: List[bool]) -> bool:
return any(x)
any([False, False, True, False]) # True
any([False, False, True, False]) # True
any(['False']) # True
any(['False']) # fails
any([False, None, "", 0]) # False
any([False, None, "", 0]) # fails
#python
GitHub
GitHub - RussBaz/enforce: Python 3.5+ runtime type checking for integration testing and data validation
Python 3.5+ runtime type checking for integration testing and data validation - RussBaz/enforce
Use pathlib
Previously it was always tempting to use string concatenation (concise, but obviously bad), now with pathlib the code is safe, concise, and readable.
Also
See how easy it to get all images recursively(without glob module):
#python
pathlib is a default module in python3, that helps you to avoid tons of confusing os.path.joins:from pathlib import Path
dataset_dir = 'data'
dirpath = Path('/path/to/dir/')
full_path = dirpath / dataset_dir
for filepath in dataset_dir.iterdir():
with filepath.open() as f:
# do stuff
Previously it was always tempting to use string concatenation (concise, but obviously bad), now with pathlib the code is safe, concise, and readable.
Also
pathlib.Path has a bunch of methods and properties, that I previously had to google:p.exists()
p.is_dir()
p.parts
p.with_name('sibling.png') # only change the name, but keep the folder
p.with_suffix('.jpg') # only change the extension, but keep the folder and the name
p.chmod(mode)
p.rmdir()
See how easy it to get all images recursively(without glob module):
found_images = pathlib.Path('/path/').glob('**/*.jpg')#python
Guide into bucketing - an optimization technique that uses buckets to determine data partitioning and avoid data shuffle.
https://luminousmen.com/post/the-5-minute-guide-to-using-bucketing-in-pyspark
https://luminousmen.com/post/the-5-minute-guide-to-using-bucketing-in-pyspark
Blog | iamluminousmen
The 5-minute guide to using bucketing in Pyspark
Learn how to optimize your Apache Spark queries with bucketing in Pyspark. Discover how bucketing can enhance performance by avoiding data shuffling.
Pip constraints files
In python, it is common practice to write all the application dependencies that are installed via pip into a separate text file called
But sometimes, especially on a long-lived project, it's hard to understand what dependencies were original. It is necessary to update them on time, not depend on packages that are outdated or no longer needed for some reason.
For example, which of the following dependencies is the original?
One of the mechanisms for separating dependencies is implemented using another text file called
It looks exactly like
To use this file, you can do it via the
#python
In python, it is common practice to write all the application dependencies that are installed via pip into a separate text file called
requirements.txt. It's good practice to fully specify package versions in your requirements file. And in our case, everything will be there — both direct dependencies of our application and dependency dependencies, etc.But sometimes, especially on a long-lived project, it's hard to understand what dependencies were original. It is necessary to update them on time, not depend on packages that are outdated or no longer needed for some reason.
For example, which of the following dependencies is the original?
# requirements.txtYes, it's
numpy==1.17.4
pandas==0.24.2
python-dateutil==2.8.1
pytz==2019.3
six==1.13.0
pandas.One of the mechanisms for separating dependencies is implemented using another text file called
constants.txt.It looks exactly like
requirements.txt:# constants.txtConstraints files differ from requirements files in one key way: putting a package in the constraints file does not cause the package to be installed, whereas a requirements file will install all packages listed. Constraints files are simply requirements files that control which version of a package will be installed but provide no control over the actual installation.
numpy==1.17.4
python-dateutil==2.8.1
pytz==2019.3
six==1.13.0
To use this file, you can do it via the
requirements.txt file:-c constraints.txtor
pandas==0.24.2
pip install -c constraints.txtwill install all packages from
requirements.txt and using constraints.txt files for version constraint.#python
12factor.net
The Twelve-Factor App
A methodology for building modern, scalable, maintainable software-as-a-service apps.
Go test the most advanced neural network model for generating text - GPT-2. You write a phrase, the neural network complements it to a small text.
Try it yourself: https://talktotransformer.com/
#ml
Try it yourself: https://talktotransformer.com/
#ml
Openai
Better language models and their implications
We’ve trained a large-scale unsupervised language model which generates coherent paragraphs of text, achieves state-of-the-art performance on many language modeling benchmarks, and performs rudimentary reading comprehension, machine translation, question…
"In all likelihood, sorting is one of the most researched classes of algorithms. It is a fundamental task in Computer Science, both on its own and as a step in other algorithms. Efficient algorithms for sorting and searching are now taught in core undergraduate classes. Are they at their best, or is there more blood to squeeze from that stone? This talk will explore a few less known – but more allegro! – variants of classic sorting algorithms. And as they say, the road matters more than the destination. Along the way, we’ll encounter many wondrous surprises and we’ll learn how to cope with the puzzling behavior of modern complex architectures."
If u know who is Andrei Alexandrescu than it's recommended 👌
https://youtu.be/FJJTYQYB1JQ
#dev
If u know who is Andrei Alexandrescu than it's recommended 👌
https://youtu.be/FJJTYQYB1JQ
#dev
YouTube
Sorting Algorithms: Speed Is Found In The Minds of People - Andrei Alexandrescu - CppCon 2019
http://CppCon.org
Discussion & Comments: https://www.reddit.com/r/cpp/
Presentation Slides, PDFs, Source Code and other presenter materials are available at: https://github.com/CppCon/CppCon2019
Sorting Algorithms: Speed Is Found In The Minds of People
…
Discussion & Comments: https://www.reddit.com/r/cpp/
Presentation Slides, PDFs, Source Code and other presenter materials are available at: https://github.com/CppCon/CppCon2019
Sorting Algorithms: Speed Is Found In The Minds of People
…
DataFrames are the wave of the future in the Spark world so let's push your Pyspark SQL knowledge in using various join types
https://luminousmen.com/post/introduction-to-pyspark-join-types
https://luminousmen.com/post/introduction-to-pyspark-join-types
Blog | iamluminousmen
Introduction to Pyspark join types
DataFrames and Spark SQL API are the waves of the future in the Spark world. Here, I will push your Pyspark SQL knowledge into using different types of joins
Your top skill
The most important skill that will allow you to stay in trend - is the skill to solve problems.
Not the list of known programming languages, not the ability to close tasks, not knowledge of the bazillion algorithms, not certificates of the scrum master.
Namely, the ability to take a real problem and solve it yourself is the main skill of a professional. Solve it yourself does not mean alone. By yourself means to be able to find the necessary resources, people, set tasks, control the result and bear responsibility for this.
Such people will always be needed, in any field and at any age.
#dev #soft_skills
The most important skill that will allow you to stay in trend - is the skill to solve problems.
Not the list of known programming languages, not the ability to close tasks, not knowledge of the bazillion algorithms, not certificates of the scrum master.
Namely, the ability to take a real problem and solve it yourself is the main skill of a professional. Solve it yourself does not mean alone. By yourself means to be able to find the necessary resources, people, set tasks, control the result and bear responsibility for this.
Such people will always be needed, in any field and at any age.
#dev #soft_skills
Continuous integration and continuous delivery are like vectors whose direction is the same and the module is different. The purpose of the tricks is the same: to increase the reliability of software development and releases, as well as speed up development and releases. Let me tell you how
https://luminousmen.com/post/continuous-Integration-continuous-delivery
https://luminousmen.com/post/continuous-Integration-continuous-delivery
Blog | iamluminousmen
Continuous Integration & Delivery main ideas
The goal of the CI/CD processes is to increase reliability and speed up software development while maintaining quality
Stop Installing Tensorflow using pip for performance sake!
There are two pretty good reasons why you should install Tensorflow using conda instead of pip. For those of you not in the know, conda is an open source, cross-platform package and environment management system.
https://towardsdatascience.com/stop-installing-tensorflow-using-pip-for-performance-sake-5854f9d9eb0c
#python
There are two pretty good reasons why you should install Tensorflow using conda instead of pip. For those of you not in the know, conda is an open source, cross-platform package and environment management system.
https://towardsdatascience.com/stop-installing-tensorflow-using-pip-for-performance-sake-5854f9d9eb0c
#python
Medium
Stop Installing Tensorflow using pip for performance sake!
Get 8X the speed boost with the conda installation compared to the pip installation.