For any error you can say the cause of it is between the monitor and the chair — and it's true, but it doesn't help fix the error in any way. To stand out today you need to bring both hard & soft skills to the table.
https://luminousmen.com/post/soft-skills-guide-for-software-engineer
#soft_skills
https://luminousmen.com/post/soft-skills-guide-for-software-engineer
#soft_skills
Often I rewrite my old articles to align them with my current understanding and very often find myself misunderstanding concepts or omitting important questions.
But sometimes when I rewrite, I realize that originally everything was correct - I'm the one thinking bullshit. It's a funny feeling.
It's exhausting to be a perfectionist. Don't be one. But read the article:
Data Lake vs Data Warehouse
But sometimes when I rewrite, I realize that originally everything was correct - I'm the one thinking bullshit. It's a funny feeling.
It's exhausting to be a perfectionist. Don't be one. But read the article:
Data Lake vs Data Warehouse
Blog | iamluminousmen
Data Lake vs Data Warehouse
Data Lake and the Data Warehouse. They seemed similar, but there are differences.
Big Data and Go. It's started
Hdfs client written in go. Good for scripting. Since it doesn't have to wait for the JVM to start up, it's also a lot faster than
https://github.com/colinmarc/hdfs
#big_data
Hdfs client written in go. Good for scripting. Since it doesn't have to wait for the JVM to start up, it's also a lot faster than
hadoop -fs.https://github.com/colinmarc/hdfs
#big_data
GitHub
GitHub - colinmarc/hdfs: A native go client for HDFS
A native go client for HDFS. Contribute to colinmarc/hdfs development by creating an account on GitHub.
Forwarded from Инжиниринг Данных (Dmitry Anoshin)
Kaggle State of Machine Learning and Data Science 2020.pdf
14 MB
Kaggle State of Machine Learning and Data Science 2020
What does the
When you look at the C implementation, the rule seems to be:
1. If
2. If
3. If
4. Whatever
5. Calling
7. If none of the above applies, then
An in-depth article on the `not' operator in Python from the core developer
#python
not operator do? It simply yields True if its argument is false, False otherwise. It turns out it's pretty hard to determine what true is.When you look at the C implementation, the rule seems to be:
1. If
True, then True;2. If
False, then False;3. If
None, then False;4. Whatever
__bool__ returns as long as it's a subclass of bool;5. Calling
len() on the object - True if greater than 0, otherwise False;7. If none of the above applies, then
True.An in-depth article on the `not' operator in Python from the core developer
#python
Tall, Snarky Canadian
Unravelling `not` in Python
For this next blog post in my series of Python's syntactic sugar, I'm tackling what would seem to be a very simple bit of syntax, but which actually requires diving into multiple layers to fully implement: not. On the surface, the definition of not is very…
Data engineering in 2020-2021
Another view on the Data Management landscape. There 9 mentions of SQL and 5 mentions of BI in the article. SQL is required knowledge for data engineer by it's not in any way the only requirement nowadays.
The author sees the future of Data Management as a way towards SQL-engines and outsource the complexity to the platforms. Unfortunately that's probably true.
Although:
▪️In practice, engineers spend most of the time on letter "T" in ETL(and not only using SQL). For example, the most popular framework for data processing Spark is much more than just RDDs today
▪️Those emerging platforms cost a pile of money now. For example AWS was born because of Oracle platform huge maintanance cost.
▪️I’m very sceptical of tools that clams “everyone can build a data product in several easy steps”.
Article
Another view on the Data Management landscape. There 9 mentions of SQL and 5 mentions of BI in the article. SQL is required knowledge for data engineer by it's not in any way the only requirement nowadays.
The author sees the future of Data Management as a way towards SQL-engines and outsource the complexity to the platforms. Unfortunately that's probably true.
Although:
▪️In practice, engineers spend most of the time on letter "T" in ETL(and not only using SQL). For example, the most popular framework for data processing Spark is much more than just RDDs today
▪️Those emerging platforms cost a pile of money now. For example AWS was born because of Oracle platform huge maintanance cost.
▪️I’m very sceptical of tools that clams “everyone can build a data product in several easy steps”.
Article
Medium
Data engineering in 2020
It is incredible how fast data processing tools are evolving. And with it, the nature of the data engineering discipline is changing as…
Working with data in big data always involves some kind of complexities related to its size, storage, and processing. What skills are needed to deal with them?
https://luminousmen.com/post/data-challenges-in-big-data
https://luminousmen.com/post/data-challenges-in-big-data
Blog | iamluminousmen
Data Challenges in Big Data
Working with data in big data always involves some kind of complexities related to its size, storage, and processing. What skills are needed to deal with them?
PEP: 585
Started trying out the new release Python 3.9. I don't follow the features that much, but there are things that piss me off, like the implementation of static typing in Python.
Static typing has been built on top of the existing Python runtime incrementally over the time. As a consequence, collection hierarchies got duplicated, as an application could use the types from typing module at the same time as the built-in ones.
This created a bit of confusion, as we had two parallel type systems, not really competing with each other, but we always had to keep an eye out for that parallelism.
Well, now this is over.
Examples of types that previously had to be imported to use would be List, Dict, Set, Tuple, Optional. Right now, you can just import them as a general list or dict, set, tuple, optional, etc.
These types can also be parameterized. A parameterized type is an example of a generic universal type with expected types for container elements of type
PEP 585
#python
Started trying out the new release Python 3.9. I don't follow the features that much, but there are things that piss me off, like the implementation of static typing in Python.
Static typing has been built on top of the existing Python runtime incrementally over the time. As a consequence, collection hierarchies got duplicated, as an application could use the types from typing module at the same time as the built-in ones.
This created a bit of confusion, as we had two parallel type systems, not really competing with each other, but we always had to keep an eye out for that parallelism.
Well, now this is over.
Examples of types that previously had to be imported to use would be List, Dict, Set, Tuple, Optional. Right now, you can just import them as a general list or dict, set, tuple, optional, etc.
>>> issubclass(list, T.List)
TrueThese types can also be parameterized. A parameterized type is an example of a generic universal type with expected types for container elements of type
list[str].PEP 585
#python
Python Enhancement Proposals (PEPs)
PEP 585 – Type Hinting Generics In Standard Collections | peps.python.org
Static typing as defined by PEPs 484, 526, 544, 560, and 563 was built incrementally on top of the existing Python runtime and constrained by existing syntax and runtime behavior. This led to the existence of a duplicated collection hierarchy in the ty...
uWorc - no-code solution for orchestrating data pipelines based on Airflow and Flink.
Engineering team at Uber described new SQL-based answer to steep learning curve in Data Management.
https://eng.uber.com/no-code-workflow-orchestrator/
#big_data
Engineering team at Uber described new SQL-based answer to steep learning curve in Data Management.
https://eng.uber.com/no-code-workflow-orchestrator/
#big_data
Uber Engineering Blog
No Code Workflow Orchestrator for Building Batch & Streaming Pipelines at Scale
Data-In-Motion @ Uber At Uber, several petabytes of data move across and within various platforms every day. We power this data movement by a strong backbone o
PEP 584
Another news on Python 3.9 about merging dictionaries.
Python already had a few ways to merge two or more dictionaries. But there were always some issues:
Now in Python 3.9 we have Dictionary Union Operator ( | ). Yep, all caps.
PEP 584
#python
Another news on Python 3.9 about merging dictionaries.
Python already had a few ways to merge two or more dictionaries. But there were always some issues:
dict1.update(dict2) This way you can merge only two dictionaries at once and this method requires a temporary variable to store the merged dictionary.{**dict1, **dict2} – This unpacking method ignores the types of mappings. It fails for dict subclasses such as defaultdict that have an incompatible __init__ methodChainMap(dict1, dict2) – Any changes to the ChainMap will modify the original dictionaries because the Chaimap variables are wrappers of the original dictionaries.Now in Python 3.9 we have Dictionary Union Operator ( | ). Yep, all caps.
>>> a = {'GME': 20, 'AMC': 20, 'TSLA': 1001}
>>> b = {'GME': 400}
>>> c = {'GME': 60}
>>> a | b | c
{'GME': 60, 'AMC': 20, 'TSLA': 1001}
>>> a |= b
>>> a
{'GME': 400, 'AMC': 20, 'TSLA': 1001}
This example shows how the dictionary union operator obeys the order of the items in the dictionary. So whichever dictionary stands first, the dictionary items from it are pulled out and the second dictionary’s elements are appended to the first one.PEP 584
#python
Python Enhancement Proposals (PEPs)
PEP 584 – Add Union Operators To dict | peps.python.org
This PEP proposes adding merge (|) and update (|=) operators to the built-in dict class.
Airflow 2.0
At the end of 2020, the major release of Apache Airflow 2.0 came out. Here are the most important things it brought with it:
1. New task scheduler. The Airflow Scheduler itself was a single point of failure and not horizontally scalable. Now it has changed - Airflow should be able to continue running data pipelines without a hiccup, even in the situation of a node failure taking down a Scheduler. Cool, right?
2. Restful API. Now we can do CRUD operations programmatically through the API! Plus, the updated security model - all REST operations now require authorization.
3. SubDAG operator was substituted with Task Groups.
4. Smart sensors. Although sensors are idle for most of the execution time, they keep a "slot" that consumes CPU and memory resources. Smart Sensor reduces the number of those slots during high load times.
Source
#big_data
At the end of 2020, the major release of Apache Airflow 2.0 came out. Here are the most important things it brought with it:
1. New task scheduler. The Airflow Scheduler itself was a single point of failure and not horizontally scalable. Now it has changed - Airflow should be able to continue running data pipelines without a hiccup, even in the situation of a node failure taking down a Scheduler. Cool, right?
2. Restful API. Now we can do CRUD operations programmatically through the API! Plus, the updated security model - all REST operations now require authorization.
3. SubDAG operator was substituted with Task Groups.
4. Smart sensors. Although sensors are idle for most of the execution time, they keep a "slot" that consumes CPU and memory resources. Smart Sensor reduces the number of those slots during high load times.
Source
#big_data
Apache Airflow
Apache Airflow 2.0 is here!
We're proud to announce that Apache Airflow 2.0.0 has been released.
How LinkedIn engineering team moved some of their pipelines from Lambda architecture to Kappa architecture. Check it out
They faced the same problems as I described here sadly not all of them they described in the post.
They faced the same problems as I described here sadly not all of them they described in the post.
Linkedin
From Lambda to Lambda-less: Lessons learned
Co-authors: Xiang Zhang and Jingyu Zhu
Just a terrific lightning talk about interesting features of two not-so-famous languages.
I screamed with laughter 😆
I screamed with laughter 😆
Another area of knowledge, which is fundamentally situated in a slightly different plane but directly related to the data. Management challenges tackle privacy, security, governance, and data/metadata management.
https://luminousmen.com/post/management-challenges-in-big-data
https://luminousmen.com/post/management-challenges-in-big-data
Blog | iamluminousmen
Management Challenges in Big Data
Management challenges tackle privacy, security, governance, and data/metadata management. Another area of knowledge directly connected to the big data
Snowflake
In my opinion, traditional storage methods and technologies face great challenges in delivering the service, simplicity and value that a rapidly changing business needs. In addition to eliminating these limitations, Snowflake provided significant performance benefits, an easy and intuitive way for administrators and users to interact with the system, and finally a way to scale to levels of concurrency that cost a huge pile of money in a MPP architecture and impossible in traditional approaches.
Snowflake or SnowflakeDB is a cloud-based SaaS database for analytical workloads and batch data processing, typically used to build a Data Warehouse in the cloud. Snowflake supports transactions, different isolation levels, ACID, read consistency and multi version concurrency control. The main features of Snowflake are decoupled storage from compute and cloud-based focus.
There's been too much news about Snowflake lately. Everyone is covering how much of a silver bullet it can be. While I like the technology side of things, it also has its downsides. Going through the article I'll make a summary for you:
▪️it's a proprietary SaaS solution. You can't solve any problem other than with the company’ support team
▪️it's a cloud native solution, you can't deploy it on-premise. Hence you not really “own” the data
▪️it’s opinionated solution. You can't optimize it much as there is not much knobs that you can change
▪️it's expensive. Depending on the use case, Snowflake may be more expensive than competitors like Amazon Redshift and traditional MPP solutions like Vertica
#big_data
In my opinion, traditional storage methods and technologies face great challenges in delivering the service, simplicity and value that a rapidly changing business needs. In addition to eliminating these limitations, Snowflake provided significant performance benefits, an easy and intuitive way for administrators and users to interact with the system, and finally a way to scale to levels of concurrency that cost a huge pile of money in a MPP architecture and impossible in traditional approaches.
Snowflake or SnowflakeDB is a cloud-based SaaS database for analytical workloads and batch data processing, typically used to build a Data Warehouse in the cloud. Snowflake supports transactions, different isolation levels, ACID, read consistency and multi version concurrency control. The main features of Snowflake are decoupled storage from compute and cloud-based focus.
There's been too much news about Snowflake lately. Everyone is covering how much of a silver bullet it can be. While I like the technology side of things, it also has its downsides. Going through the article I'll make a summary for you:
▪️it's a proprietary SaaS solution. You can't solve any problem other than with the company’ support team
▪️it's a cloud native solution, you can't deploy it on-premise. Hence you not really “own” the data
▪️it’s opinionated solution. You can't optimize it much as there is not much knobs that you can change
▪️it's expensive. Depending on the use case, Snowflake may be more expensive than competitors like Amazon Redshift and traditional MPP solutions like Vertica
#big_data
Distributed Systems Architecture
Snowflake: The Good, The Bad and The Ugly
Snowflake or SnowflakeDB is a cloud SaaS database for analytical workloads and batch data ingestion, typically used for building a data warehouse in the cloud. However, it appears to be so cool and shiny that people are getting mad at praising it all around…
When you’re choosing a base image for your Docker image, Alpine Linux is often recommended. Using Alpine, you’re told, will make your images smaller and speed up your builds. But Alpine builds are vastly slower, the image is bigger.
I've faced the same issues with the custom Airflow/Python container as the author describing here, and as a solution, I switched to the slim docker images. Doing fine so far :)
#python
I've faced the same issues with the custom Airflow/Python container as the author describing here, and as a solution, I switched to the slim docker images. Doing fine so far :)
#python
Python⇒Speed
Using Alpine can make Python Docker builds 50× slower
Alpine Linux is often recommended as a smaller, faster Docker base image. But if you’re using Python, it will slow down your build and make your image larger.
JetBrains published the results of its Python Developers Survey for 2020. After questioning over 28,000 Python developers and fans from nearly 200 countries/regions in October 2020, here are some of the notable insights:
⚡️ JavaScript is the most popular language for developers to use with Python, particularly for web developers. As the survey notes, with HTML/CSS, Bash/Shell, and SQL, "they create a stack of languages where 2 out of every 5 Python devs are using at least one of them."
⚡️ JavaScript and C/C++ are the most common main languages for those who use Python as a secondary language.
⚡️ 55% surveyed use Python for data analysis, making it the top use case, with 50% using it for web development.
⚡️ 94% use Python 3, with 6% using Python 2. 44% are using Python 3.8.
⚡️ Flask is the most popular web framework (46%), followed by Django at 43% and FastAPI at 12%.
⚡️ Most Pythonistas who use Flask prefer SQLAlchemy, while Django users use Django ORM.
⚡️ PostgreSQL is the most popular database amongst Python developers (45%).
⚡️ AWS is the top cloud platform (53%), followed by Google Cloud (33%).
⚡️ Linux is the most popular operating system (68%), followed by Windows (48%).
Check out the full report for more insights.
#python
⚡️ JavaScript is the most popular language for developers to use with Python, particularly for web developers. As the survey notes, with HTML/CSS, Bash/Shell, and SQL, "they create a stack of languages where 2 out of every 5 Python devs are using at least one of them."
⚡️ JavaScript and C/C++ are the most common main languages for those who use Python as a secondary language.
⚡️ 55% surveyed use Python for data analysis, making it the top use case, with 50% using it for web development.
⚡️ 94% use Python 3, with 6% using Python 2. 44% are using Python 3.8.
⚡️ Flask is the most popular web framework (46%), followed by Django at 43% and FastAPI at 12%.
⚡️ Most Pythonistas who use Flask prefer SQLAlchemy, while Django users use Django ORM.
⚡️ PostgreSQL is the most popular database amongst Python developers (45%).
⚡️ AWS is the top cloud platform (53%), followed by Google Cloud (33%).
⚡️ Linux is the most popular operating system (68%), followed by Windows (48%).
Check out the full report for more insights.
#python
JetBrains: Developer Tools for Professionals and Teams
Python Developers Survey 2020 Results
Official Python Developers Survey 2020 Results by Python Software Foundation and JetBrains: more than 28k responses from more than 150 countries.
I want to share this seminar series which I've been following for a while now.
So Standford created a MLSys series where they taking a look at the frontier of machine learning systems, and how machine learning changes the modern programming stack. They really have a lot of rockstar speakers and interesting conversations! Recommended 👌
https://mlsys.stanford.edu/
#ml
So Standford created a MLSys series where they taking a look at the frontier of machine learning systems, and how machine learning changes the modern programming stack. They really have a lot of rockstar speakers and interesting conversations! Recommended 👌
https://mlsys.stanford.edu/
#ml
mlsys.stanford.edu
Stanford MLSys Seminar
Seminar series on the frontier of machine learning and systems.