L̶u̵m̶i̵n̷o̴u̶s̶m̶e̵n̵B̶l̵o̵g̵ – Telegram

L̶u̵m̶i̵n̷o̴u̶s̶m̶e̵n̵B̶l̵o̵g̵

@iamluminousmen

502 subscribers

156 photos

32 videos

2 files

701 links

(ﾉ◕ヮ◕)ﾉ*:･ﾟ✧ ✧ﾟ･: *ヽ(◕ヮ◕ヽ)

helping robots conquer the earth and trying not to increase entropy using Python, Data Engineering and Machine Learning

http://luminousmen.com

License: CC BY-NC-ND 4.0

Download Telegram

About

Blog

Apps

Platform

L̶u̵m̶i̵n̷o̴u̶s̶m̶e̵n̵B̶l̵o̵g̵

502 subscribers

L̶u̵m̶i̵n̷o̴u̶s̶m̶e̵n̵B̶l̵o̵g̵

Airflow 2.0

At the end of 2020, the major release of Apache Airflow 2.0 came out. Here are the most important things it brought with it:

1. New task scheduler. The Airflow Scheduler itself was a single point of failure and not horizontally scalable. Now it has changed - Airflow should be able to continue running data pipelines without a hiccup, even in the situation of a node failure taking down a Scheduler. Cool, right?

2. Restful API. Now we can do CRUD operations programmatically through the API! Plus, the updated security model - all REST operations now require authorization.

3. SubDAG operator was substituted with Task Groups.

4. Smart sensors. Although sensors are idle for most of the execution time, they keep a "slot" that consumes CPU and memory resources. Smart Sensor reduces the number of those slots during high load times.

Source

#big_data

Apache Airflow 2.0 is here!

We're proud to announce that Apache Airflow 2.0.0 has been released.

255 views13:16

L̶u̵m̶i̵n̷o̴u̶s̶m̶e̵n̵B̶l̵o̵g̵

How LinkedIn engineering team moved some of their pipelines from Lambda architecture to Kappa architecture. Check it out

They faced the same problems as I described here sadly not all of them they described in the post.

From Lambda to Lambda-less: Lessons learned

Co-authors: Xiang Zhang and Jingyu Zhu

245 views13:16

L̶u̵m̶i̵n̷o̴u̶s̶m̶e̵n̵B̶l̵o̵g̵

This is awesome! OpenAI GPT-3 Powered NPCs.

https://youtu.be/jH-6-ZIgmKY

OpenAI GPT-3 Powered NPCs: A Must-Watch Glimpse Of The Future (Modbox)

The developer of Modbox linked together Windows speech recognition, OpenAI's GPT-3 AI, and Replica's natural speech synthesis for a unique demo: arguably the first AI NPC.

There is an uncomfortably long delay between asking a question and getting a response.…

236 views09:12

L̶u̵m̶i̵n̷o̴u̶s̶m̶e̵n̵B̶l̵o̵g̵

This media is not supported in your browser

VIEW IN TELEGRAM

How to start your blog

https://luminousmen.com/post/how-to-start-your-blog-for-20-cents

524 views13:16

L̶u̵m̶i̵n̷o̴u̶s̶m̶e̵n̵B̶l̵o̵g̵

Just a terrific lightning talk about interesting features of two not-so-famous languages.

I screamed with laughter 😆

222 views13:16

L̶u̵m̶i̵n̷o̴u̶s̶m̶e̵n̵B̶l̵o̵g̵

Another area of knowledge, which is fundamentally situated in a slightly different plane but directly related to the data. Management challenges tackle privacy, security, governance, and data/metadata management.

https://luminousmen.com/post/management-challenges-in-big-data

Blog | iamluminousmen

Management Challenges in Big Data

Management challenges tackle privacy, security, governance, and data/metadata management. Another area of knowledge directly connected to the big data

221 views09:12

L̶u̵m̶i̵n̷o̴u̶s̶m̶e̵n̵B̶l̵o̵g̵

Snowflake

In my opinion, traditional storage methods and technologies face great challenges in delivering the service, simplicity and value that a rapidly changing business needs. In addition to eliminating these limitations, Snowflake provided significant performance benefits, an easy and intuitive way for administrators and users to interact with the system, and finally a way to scale to levels of concurrency that cost a huge pile of money in a MPP architecture and impossible in traditional approaches.

Snowflake or SnowflakeDB is a cloud-based SaaS database for analytical workloads and batch data processing, typically used to build a Data Warehouse in the cloud. Snowflake supports transactions, different isolation levels, ACID, read consistency and multi version concurrency control. The main features of Snowflake are decoupled storage from compute and cloud-based focus.

There's been too much news about Snowflake lately. Everyone is covering how much of a silver bullet it can be. While I like the technology side of things, it also has its downsides. Going through the article I'll make a summary for you:

▪️it's a proprietary SaaS solution. You can't solve any problem other than with the company’ support team
▪️it's a cloud native solution, you can't deploy it on-premise. Hence you not really “own” the data
▪️it’s opinionated solution. You can't optimize it much as there is not much knobs that you can change
▪️it's expensive. Depending on the use case, Snowflake may be more expensive than competitors like Amazon Redshift and traditional MPP solutions like Vertica

#big_data

Distributed Systems Architecture

Snowflake: The Good, The Bad and The Ugly

Snowflake or SnowflakeDB is a cloud SaaS database for analytical workloads and batch data ingestion, typically used for building a data warehouse in the cloud. However, it appears to be so cool and shiny that people are getting mad at praising it all around…

246 views13:16

L̶u̵m̶i̵n̷o̴u̶s̶m̶e̵n̵B̶l̵o̵g̵

When you’re choosing a base image for your Docker image, Alpine Linux is often recommended. Using Alpine, you’re told, will make your images smaller and speed up your builds. But Alpine builds are vastly slower, the image is bigger.

I've faced the same issues with the custom Airflow/Python container as the author describing here, and as a solution, I switched to the slim docker images. Doing fine so far :)

#python

Using Alpine can make Python Docker builds 50× slower

Alpine Linux is often recommended as a smaller, faster Docker base image. But if you’re using Python, it will slow down your build and make your image larger.

215 views07:10

L̶u̵m̶i̵n̷o̴u̶s̶m̶e̵n̵B̶l̵o̵g̵

JetBrains published the results of its Python Developers Survey for 2020. After questioning over 28,000 Python developers and fans from nearly 200 countries/regions in October 2020, here are some of the notable insights:

⚡️ JavaScript is the most popular language for developers to use with Python, particularly for web developers. As the survey notes, with HTML/CSS, Bash/Shell, and SQL, "they create a stack of languages where 2 out of every 5 Python devs are using at least one of them."
⚡️ JavaScript and C/C++ are the most common main languages for those who use Python as a secondary language.
⚡️ 55% surveyed use Python for data analysis, making it the top use case, with 50% using it for web development.
⚡️ 94% use Python 3, with 6% using Python 2. 44% are using Python 3.8.
⚡️ Flask is the most popular web framework (46%), followed by Django at 43% and FastAPI at 12%.
⚡️ Most Pythonistas who use Flask prefer SQLAlchemy, while Django users use Django ORM.
⚡️ PostgreSQL is the most popular database amongst Python developers (45%).
⚡️ AWS is the top cloud platform (53%), followed by Google Cloud (33%).
⚡️ Linux is the most popular operating system (68%), followed by Windows (48%).

Check out the full report for more insights.

#python

JetBrains: Developer Tools for Professionals and Teams

Python Developers Survey 2020 Results

Official Python Developers Survey 2020 Results by Python Software Foundation and JetBrains: more than 28k responses from more than 150 countries.

248 views13:16

L̶u̵m̶i̵n̷o̴u̶s̶m̶e̵n̵B̶l̵o̵g̵

I want to share this seminar series which I've been following for a while now.

So Standford created a MLSys series where they taking a look at the frontier of machine learning systems, and how machine learning changes the modern programming stack. They really have a lot of rockstar speakers and interesting conversations! Recommended 👌

https://mlsys.stanford.edu/

#ml

mlsys.stanford.edu

Stanford MLSys Seminar

Seminar series on the frontier of machine learning and systems.

242 viewsedited 07:10

L̶u̵m̶i̵n̷o̴u̶s̶m̶e̵n̵B̶l̵o̵g̵

Pattern matching

For a long time, the Python community has put forward various suggestions for the realization of specified multi-branch conditions in the Python language (similar to the switch statement in C/C++), but none of the proposals can be finally implemented. In the past year or so, the Python community discussed a proposal that might solve the multi-branch condition problem (or even more) and was adopted the pattern matching suggestions. The specific situation is to adopt PEP 634, 635, and 636. It will be a feature of new release Python 3.10.

Examples are as follows:

match command.split(): 
    case ["quit"]:
        print("Goodbye!")
        quit_game()
    case ["get", obj]:
        character.get(obj, current_room)
    case ["go", direction]:
        current_room = current_room.neighbor(direction)
    case _:
        print(f"Sorry, I couldn't understand {command!r}")

For me it look like some Scala-like syntax. Which is not bad.

#python

Python Enhancement Proposals (PEPs)

PEP 634 – Structural Pattern Matching: Specification | peps.python.org

This PEP provides the technical specification for the match statement. It replaces PEP 622, which is hereby split in three parts:

271 views13:16

L̶u̵m̶i̵n̷o̴u̶s̶m̶e̵n̵B̶l̵o̵g̵

This guy shared a Data Science cheatsheet to assist with exam reviews, interview prep, etc. Worth sharing.

#ds

347 views07:10

L̶u̵m̶i̵n̷o̴u̶s̶m̶e̵n̵B̶l̵o̵g̵

Hey everybody,

Channel is growing so let me collect all the interesting posts from this channel that I think should get more attention:

#dev
⚡️Technological degradation
⚡️Define "production-ready"
⚡️Abstraction is not OOP

#python
⚡️Use pathlib
⚡️Pip constraints files
⚡️.pth files

#soft_skills
⚡️Your top skill
⚡️Ask stupid questions
⚡️Let robots yell at people
⚡️Soft skills thoughts

#big_data
⚡️S3 vs HDFS
⚡️Famous in-memory data format
⚡️Complexity in distributed systems
⚡️Snowflake

#ml
⚡️ML system basic framework
⚡️AutoML
⚡️MLOps
⚡️Testing and validation in ML

Bigger posts I'm sharing on my small website ✌️

404 viewsedited 13:16

L̶u̵m̶i̵n̷o̴u̶s̶m̶e̵n̵B̶l̵o̵g̵

L̶u̵m̶i̵n̷o̴u̶s̶m̶e̵n̵B̶l̵o̵g̵ pinned «Hey everybody, Channel is growing so let me collect all the interesting posts from this channel that I think should get more attention: #dev ⚡️Technological degradation ⚡️Define "production-ready" ⚡️Abstraction is not OOP #python ⚡️Use pathlib ⚡️Pip constraints…»

14:39

L̶u̵m̶i̵n̷o̴u̶s̶m̶e̵n̵B̶l̵o̵g̵

300 views13:16

L̶u̵m̶i̵n̷o̴u̶s̶m̶e̵n̵B̶l̵o̵g̵

Perhaps not every Python developer knows the interesting property of the CPython that makes newcomers go crazy:

>>> a = 255
>>> b = 255
>>> a == b
True
>>> a is b
True

The double equal operator checks that the objects values are equal and the is operator checks that the variables refer to the same object. a and b are different objects, so a is b will return False.

There is actually an optimization in Python regarding small integers (-5 to 256 inclusive). These objects are loaded into the interpreter's memory when the interpreter starts up. This results in a small internal cache. Because of this, the variables with the same values point to the same object and the result is True.

A similar example for number > 256 will work as expected:

>>> a = 257
>>> b = 257
>>> a == b
True
>>> a is b 
False

305 views07:10

L̶u̵m̶i̵n̷o̴u̶s̶m̶e̵n̵B̶l̵o̵g̵

I first wanted to call the article Data Engineering terms, but then I thought that I was describing more than just terminology. We can think of those terms as architectural requirements that greatly affect the design, cost, and ultimate usefulness of the systems. Behind the marketing and noise, there are real problems that need to be tackled.

https://luminousmen.com/post/architecturally-significant-requirements

Blog | iamluminousmen

Architecturally Significant Requirements

Discover the crucial Architecturally Significant Requirements (ASR) for distributed systems, including Availability, Durability, Resiliency, Reliability, and Scalability. Learn how these factors impact system design and performance.

248 views07:10

L̶u̵m̶i̵n̷o̴u̶s̶m̶e̵n̵B̶l̵o̵g̵

A collection of AWS related videos, podcasts, code repositories, whitepapers, and feature releases, all in a single, easy to search interface.

https://awsstash.com/

#aws

242 views13:16

L̶u̵m̶i̵n̷o̴u̶s̶m̶e̵n̵B̶l̵o̵g̵

Btw this also work for strings:

>>> a = 'the'
>>> b = 'the'
>>> a == b
True
>>> a is b
True

As the strings are immutable it makes sense for the interpreter to store the string literal only once and point all the variables to the same object. It is called string interning. And it's kind of internal optimization technique for working with valid identifiers.

In Python 3.6, any string with length ≤ 20 will get interned. But Python 3.7 uses the AST optimizer and (most) strings up to 4096 characters are interned.

>>> a = 'the thing'
>>> b = 'the thing'
>>> a == b
True
>>> a is b
False

Wtf you may be wonder. This string is just not a valid identifier that's all. But we can make interning explicit:

>>> import sys
>>> b = sys.intern('the thing')
>>> a = sys.intern('the thing')
>>> a == b
True
>>> a is b
True

Why?

🔸 Saving memory
🔸 Fast comparisons
🔸 Fast dictionary lookups

#python

L̶u̵m̶i̵n̷o̴u̶s̶m̶e̵n̵B̶l̵o̵g̵

Perhaps not every Python developer knows the interesting property of the CPython that makes newcomers go crazy:

>>> a = 255
>>> b = 255
>>> a == b
True
>>> a is b
True

The double equal operator checks that the objects values are equal and the is operator…

246 viewsedited 13:16

L̶u̵m̶i̵n̷o̴u̶s̶m̶e̵n̵B̶l̵o̵g̵

Delta Lake vs Iceberg vs Hudi vs Hive ACID

Many companies realized that while Hadoop was initially cheap to set up, it required hiring much more expensive engineers to write optimal codes, otherwise the user community could easily and quickly direct an inefficient workload to a big data system, and then this led to problems with SLA, scalability, and maintenance costs.

Hadoop-based approaches in most cases still cheaper than relying mostly on Redshift/Teradata/Exadata/Netezza/Vertica, but it is obvious why Lakehouse(new marketing term from Databricks) solutions like Hive ACID, Delta Lake, Apache Hudi, Iceberg started to appear.

▪️Among them, the Apache Spark great success in commercial, so behind by commercial companies, Databricks launched Delta Lake.

▪️Apache Hudi by Uber engineers to meet its internal data analysis needs designed a Data Lake, which provides fast upsert, delete, compaction, and other functions to accurately his pain points. Plus the members of the project's community sharing technical details and gradually attracting the attention of potential users. Btw Hudi has been integrated into EMR long ago, it is very attractive to use Hudi on AWS.

▪️Apache Iceberg moment would appear to be relatively mediocre from a marketing point of view, functionality is not that rich, but it is an ambitious project because it has high-level abstractions and a very elegant design.

So,

➕ Delta Lake has the best momentum

➕ Iceberg has the best design

➕Hudi has an awesome performance

➖Hive is unfortunately fading away

Source

#big_data

Reshape Data Lake: Delta, Iceberg, Hudi, or Hive

The super success of Spark in the ETL area also showed that many paradigms in the traditional data warehouse are indeed critical and useful

249 views16:19

L̶u̵m̶i̵n̷o̴u̶s̶m̶e̵n̵B̶l̵o̵g̵

Another StackOverflow question that can be a decent article. Double precision is different in different languages

Stages of greif in emoji should be appropriate in this message:
denial - 😊
anger - 😡
bargaining - 😣
depression - 😭
acceptance - 😔

Double precision is different in different languages

I'm experimenting with the precision of a double value in various programming languages.
My programs
main.c
#include <stdio.h>

int main() {
for (double i = 0.0; i < 3; i = i + 0.1) {
...

230 views13:16